You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(majavah: testing delete-crashing-pods emailer component with a test tool T292925)
imported>Stashbot
(wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko)
(55 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2021-12-14 ===
=== 2022-06-23 ===
* 09:46 majavah: testing delete-crashing-pods emailer component with a test tool [[phab:T292925|T292925]]
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2021-12-08 ===
=== 2022-06-22 ===
* 05:21 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1028
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2021-12-07 ===
=== 2022-06-21 ===
* 11:11 arturo: updated member roles in github.com/toolforge: remove brooke as owner, add dcaro
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2021-12-06 ===
=== 2022-06-03 ===
* 13:23 majavah: root@toolserver-proxy-01:~# systemctl restart apache2.service # working around [[phab:T293826|T293826]]
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2021-12-04 ===
=== 2022-06-02 ===
* 12:18 majavah: deploying delete-crashing-pods in dry run mode [[phab:T292925|T292925]]
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2021-11-28 ===
=== 2022-06-01 ===
* 17:46 andrewbogott: moving tools-k8s-etcd-13 to cloudvirt1020; cloudvirt1018 (its old host) has a degraded raid which is affecting performance
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]


=== 2021-11-19 ===
=== 2022-05-31 ===
* 13:16 majavah: manually add 3 project members after ldap issues were fixed
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation


=== 2021-11-16 ===
=== 2022-05-30 ===
* 12:31 majavah: uploading calico 3.21.0 to the internal docker registry [[phab:T292698|T292698]]
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]
* 10:28 majavah: deploying maintain-kubeusers changes [[phab:T286857|T286857]]


=== 2021-11-11 ===
=== 2022-05-26 ===
* 10:50 arturo: add user `srv-networktests` as project user ([[phab:T294955|T294955]])
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko


=== 2021-11-05 ===
=== 2022-05-22 ===
* 19:18 majavah: deploying registry-admission changes
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko


=== 2021-10-29 ===
=== 2022-05-16 ===
* 23:58 andrewbogott: deleting all files older than 14 days in /srv/tools/shared/tools/project/.shared/cache
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko


=== 2021-10-28 ===
=== 2022-05-14 ===
* 12:42 arturo: set `allow-snippet-annotations: "false"` for ingress-nginx ([[phab:T294330|T294330]])
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940


=== 2021-10-26 ===
=== 2022-05-12 ===
* 18:00 majavah: deleting legacy ingresses for tools.wmflabs.org urls
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 12:26 majavah: deploy ingress-admission updates
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 12:11 majavah: deploy ingress-nginx v1.0.4 / chart v4.0.6 on toolforge [[phab:T292771|T292771]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2021-10-25 ===
=== 2022-05-10 ===
* 14:33 majavah: copy nginx-ingress controller v1.0.4 to internal registry [[phab:T292771|T292771]]
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 11:32 majavah: depool tools-sgeexec-0910 [[phab:T294228|T294228]]
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]
* 11:13 majavah: removed tons of duplicate qw jobs accross multiple tools


=== 2021-10-22 ===
=== 2022-05-06 ===
* 15:35 majavah: remove "^tools-k8s-master-[0-9]+\.tools\.eqiad\.wmflabs$" from authorized_regexes for the main certificate
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 15:35 majavah: add mail.tools.wmcloud.org to the tools mail tls certificate alternative names


=== 2021-10-21 ===
=== 2022-05-05 ===
* 09:48 majavah: deploying toolforge-webservice 0.79
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]


=== 2021-10-20 ===
=== 2022-05-03 ===
* 15:41 majavah: removing toollabs-webservice from grid exec and master nodes where it's not needed and not managed by puppet
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 12:51 majavah: rolling out toolforge-webservice 0.78 [[phab:T292706|T292706]] [[phab:T282975|T282975]] [[phab:T276626|T276626]]


=== 2021-10-15 ===
=== 2022-05-02 ===
* 15:01 arturo: add updated ingress-nginx docker image in the registry (v1.0.1) for [[phab:T293472|T293472]]
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2021-10-07 ===
=== 2022-04-25 ===
* 09:13 majavah: disabling settings api, now that all pod presets are gone [[phab:T279106|T279106]]
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 08:00 majavah: removing all pod presets [[phab:T279106|T279106]]
* 14:46 bd808: Building toolforge-webservice v0.82
* 05:44 majavah: deploying fix for [[phab:T292672|T292672]]


=== 2021-10-06 ===
=== 2022-04-23 ===
* 06:46 majavah: taavi@toolserver-proxy-01:~$ sudo systemctl restart apache2.service # see if it helps with toolserver.org ssl alerts
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])


=== 2021-10-03 ===
=== 2022-04-20 ===
* 21:31 bstorm: rebuilding buster containers since they are also affected [[phab:T291387|T291387]] [[phab:T292355|T292355]]
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 21:29 bstorm: rebuilt stretch containers for potential issues with LE cert updates [[phab:T291387|T291387]]
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2021-10-01 ===
=== 2022-04-16 ===
* 21:59 bd808: clush -w @all -b 'sudo sed -i "s#mozilla/DST_Root_CA_X3.crt#!mozilla/DST_Root_CA_X3.crt#" /etc/ca-certificates.conf && sudo update-ca-certificates' for [[phab:T292289|T292289]]
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko


=== 2021-09-30 ===
=== 2022-04-12 ===
* 13:43 majavah: cleaning up unused kubernetes ingress objects for tools.wmflabs.org urls [[phab:T292105|T292105]]
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2021-09-29 ===
=== 2022-04-10 ===
* 22:39 bstorm: finished deploy of the toollabs-webservice 0.77 and updating labels across the k8s cluster to match
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
* 22:26 bstorm: pushing toollabs-webservice 0.77 to tools releases
* 21:46 bstorm: pushing toollabs-webservice 0.77 to toolsbeta


=== 2021-09-27 ===
=== 2022-04-09 ===
* 16:19 majavah: deploy volume-admission fix for containers for some volumes mounted
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 13:01 majavah: publish jobutils and misctools 0.43 [[phab:T286072|T286072]]
* 11:34 majavah: disabling pod preset controller [[phab:T279106|T279106]]


=== 2021-09-23 ===
=== 2022-04-08 ===
* 17:20 majavah: deploying new maintain-kubeusers for lack of podpresets [[phab:T279106|T279106]]
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component


=== 2021-09-22 ===
=== 2022-04-05 ===
* 18:06 bstorm: launching tools-nfs-test-client-01 to run a "fair" test battery against [[phab:T291406|T291406]]
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 11:37 dcaro: controlled undrain tools-k8s-worker-53 ([[phab:T291546|T291546]])
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 08:57 majavah: drain tools-k8s-worker-53
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2021-09-20 ===
=== 2022-04-04 ===
* 12:44 majavah: deploying volume-admission to tools, should not affect anything yet [[phab:T279106|T279106]]
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions


=== 2021-09-15 ===
=== 2022-03-28 ===
* 08:08 majavah: update tools-manifest to 0.24
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo


=== 2021-09-14 ===
=== 2022-03-15 ===
* 10:36 arturo: add toolforge-jobs-framework-cli v5 to aptly buster-tools/toolsbeta
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)


=== 2021-09-13 ===
=== 2022-03-14 ===
* 08:57 arturo: cleared grid queues error states ([[phab:T290844|T290844]])
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 08:55 arturo: repooling sgeexec-0907 ([[phab:T290798|T290798]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])
* 08:14 arturo: rebooting sgeexec-0907 ([[phab:T290798|T290798]])
* 08:12 arturo: depool sgeexec-0907 ([[phab:T290798|T290798]])


=== 2021-09-11 ===
=== 2022-03-10 ===
* 08:51 majavah: depool tools-sgeexec-0907
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902


=== 2021-09-10 ===
=== 2022-03-01 ===
* 23:26 bstorm: cleared error state for tools-sgeexec-0907.tools.eqiad.wmflabs
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 12:00 arturo: shutdown tools-package-builder-03 (buster), leave -04 online (bullseye)
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 09:35 arturo: live-hacking tools puppetmaster with a couple of ops/puppet changes
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand
* 07:54 arturo: created bullseye VM tools-package-builder-04 ([[phab:T273942|T273942]])


=== 2021-09-09 ===
=== 2022-02-28 ===
* 16:20 arturo: {{Gerrit|70017ec0ac}} root@tools-k8s-control-3:~# kubectl apply -f /etc/kubernetes/psp/base-pod-security-policies.yaml
* 08:02 taavi: reboot sgeexec-0916
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /


=== 2021-09-07 ===
=== 2022-02-17 ===
* 15:27 majavah: rolling out python3-prometheus-client updates
* 08:23 taavi: deleted tools-clushmaster-02
* 14:41 majavah: manually removing some absented but still present crontabs to stop root@ spam
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access


=== 2021-09-06 ===
=== 2022-02-16 ===
* 16:31 arturo: deploying jobs-framework-cli v4
* 00:12 bd808: Image builds completed.
* 16:22 arturo: deploying jobs-framework-api {{Gerrit|3228d97}}


=== 2021-09-03 ===
=== 2022-02-15 ===
* 22:36 bstorm: backfilling quotas in screen for [[phab:T286784|T286784]]
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 12:49 majavah: deploying new tools-manifest version
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2021-09-02 ===
=== 2022-02-10 ===
* 01:02 bstorm: deployed new version of maintain-kubeusers with new count quotas for new tools [[phab:T286784|T286784]]
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2021-08-20 ===
=== 2022-02-09 ===
* 19:10 majavah: rebuilding node12-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> to use debian packaged npm 7
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 18:42 majavah: rebuilding php74-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> to use composer 2
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2021-08-18 ===
=== 2022-02-07 ===
* 21:32 bstorm: rebooted tools-sgecron-01 due to a ram filling up and killing everything
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 16:34 bstorm: deleting the sssd cache on tools-sgecron-01 to fix a peculiar passwd db issue
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]


=== 2021-08-16 ===
=== 2022-02-04 ===
* 17:00 majavah: remove and re-add toollabs-webservice 0.75 on stretch-toolsbeta repository
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 15:45 majavah: reset sul account mapping on striker for developer account "DutchTom" [[phab:T288969|T288969]]
* 21:36 taavi: clear error state from some webgrid nodes
* 14:19 majavah: building node12 images - [[phab:T284590|T284590]] [[phab:T243159|T243159]]


=== 2021-08-15 ===
=== 2022-02-03 ===
* 17:30 majavah: deploying update jobs-framework-api container list to include bullseye images
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 17:22 majavah: finished initial build of images: php74, jdk17, python39, ruby27 - [[phab:T284590|T284590]]
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate
* 16:51 majavah: starting build of initial bullseye based images - [[phab:T284590|T284590]]
* 16:44 majavah: tagged and building toollabs-webservice 0.76 with bullseye images defined [[phab:T284590|T284590]]
* 15:14 majavah: building tools-webservice 0.74 (currently live version) to bullseye-tools and bullseye-toolsbeta


=== 2021-08-12 ===
=== 2022-01-30 ===
* 16:59 bstorm: deployed updated manifest for ingress-admission
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 16:45 bstorm: restarted ingress admission pods in tools after testing in toolsbeta
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]
* 16:27 bstorm: updated the docker image for docker-registry.tools.wmflabs.org/ingress-admission:latest
* 16:22 bstorm: rebooting tools-docker-registry-05 after exchanging uids for puppet and docker-registry


=== 2021-08-07 ===
=== 2022-01-26 ===
* 05:59 majavah: restart nginx on toolserver-proxy-01 if that helps with flapping icinga certificate expiry check
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2021-08-06 ===
=== 2022-01-25 ===
* 16:17 bstorm: failed over to tools-docker-registry-06 (which has more space) [[phab:T288229|T288229]]
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 00:43 bstorm: set up sync between the new registry host and the existing one [[phab:T288229|T288229]]
* 11:44 arturo: rebooting buster exec nodes
* 00:21 bstorm: provisioning second docker registry server to rsync to (120GB disk and fairly large server) [[phab:T288229|T288229]]
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2021-08-05 ===
=== 2022-01-24 ===
* 23:50 bstorm: rebooting the docker registry [[phab:T288229|T288229]]
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 23:04 bstorm: extended docker registry volume to 120GB [[phab:T288229|T288229]]
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2021-07-29 ===
=== 2022-01-20 ===
* 18:04 majavah: reset sul account mapping on striker for developer account "Derek Zax" [[phab:T287369|T287369]]
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2021-07-28 ===
=== 2022-01-19 ===
* 21:33 majavah: add mdipietro as projectadmin and to sudo policy [[phab:T287287|T287287]]
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move


=== 2021-07-27 ===
=== 2022-01-14 ===
* 16:20 bstorm: built new php images with python2 on board [[phab:T287421|T287421]]
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
* 00:04 bstorm: deploy a version of the php3.7 web image that includes the python2 package with tag :testing [[phab:T287421|T287421]]


=== 2021-07-26 ===
=== 2022-01-12 ===
* 17:37 bstorm: repooled the whole set of ingress workers after upgrades [[phab:T280340|T280340]]
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 16:37 bstorm: removing tools-k8s-ingress-4 from active ingress nodes at the proxy [[phab:T280340|T280340]]
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2021-07-23 ===
=== 2022-01-04 ===
* 07:15 majavah: restart nginx on tools-static-14 to see if it helps with fontcdn issues
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
 
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
=== 2021-07-22 ===
* 23:35 bstorm: deleted tools-sgebastion-09 since it has been shut off since March anyway
* 15:32 arturo: re-deploying toolforge-jobs-framework-api
* 15:30 arturo: pushed new docker image on the registry for toolforge-jobs-framework-api {{Gerrit|4d8235b879adbac9122a968b4335cf2bafee2b61}} ([[phab:T287077|T287077]])
 
=== 2021-07-21 ===
* 20:01 bstorm: deployed new maintain-kubeusers to toolforge [[phab:T285011|T285011]]
* 19:55 bstorm: deployed new rbac for maintain-kubeusers changes [[phab:T285011|T285011]]
* 17:10 majavah: deploying calico v3.18.4 [[phab:T280342|T280342]]
* 14:35 majavah: updating systemd on toolforge stretch bastions [[phab:T287036|T287036]]
* 11:59 arturo: deploying jobs-framework-api {{Gerrit|07346d715d17585db9c16dd152cc91ef0bea33c3}} ([[phab:T286108|T286108]])
* 11:04 arturo: enabling TTLAfterFinished feature gate on kubeadm live configmap ([[phab:T286108|T286108]])
* 11:01 arturo: enabling TTLAfterFinished feature gate on static pod manifests on /etc/kubernetes/manifests/kube-<nowiki>{</nowiki>apiserver,controller-manager<nowiki>}</nowiki>.yaml in all 3 control nodes ([[phab:T286108|T286108]])
 
=== 2021-07-20 ===
* 18:42 majavah: deploying systemd security tools on toolforge public stretch machines [[phab:T287004|T287004]]
* 17:45 arturo: pushed new toolforge-jobs-framework-api docker image into the registry ({{Gerrit|3a6ae38d51202c5c765c8d800cb8380e2a20b998}}) ([[phab:T286126|T286126]]
* 17:37 arturo: added toolforge-jobs-framework-cli v3 to aptly buster-tools and buster-toolsbeta
* 13:25 majavah: apply buster systemd security updates
 
=== 2021-07-19 ===
* 23:24 bstorm: applied matchPolicy: equivalent to tools ingress validation controller [[phab:T280360|T280360]]
* 16:43 bstorm: cleared queue error state caused by excessive resource use by topicmatcher [[phab:T282474|T282474]]
 
=== 2021-07-16 ===
* 14:04 arturo: deployed jobs-framework-api {{Gerrit|42b7a885a5bc1bf00c300e8d77bd92e1430a8327}} ([[phab:T286132|T286132]])
* 11:57 arturo: added toollabs-webservice_0.75_all to jessie-tools aptly repo ([[phab:T286003|T286003]])
* 11:52 arturo: created `jessie-tools` aptly repository on tools-services-05 ([[phab:T286003|T286003]])
 
=== 2021-07-15 ===
* 16:12 arturo: deploy toolforge-jobs-framework-api git version {{Gerrit|d85d93ee1c5d4be6a526cf83e806b2679dde3875}} ([[phab:T285944|T285944]], [[phab:T286107|T286107]], [[phab:T285979|T285979]], [[phab:T286485|T286485]], [[phab:T286107|T286107]])
* 15:55 arturo: added toolforge-jobs-framework-cli_2_all.deb to buster-<nowiki>{</nowiki>tools,toolsbeta<nowiki>}</nowiki> ([[phab:T285944|T285944]])
 
=== 2021-07-14 ===
* 23:29 bstorm: mounted nfs on tools-services-05 and backing up aptly to NFS dir [[phab:T286003|T286003]]
* 09:17 majavah: copying calico 3.18.4 images from docker hub to docker-registry.tools.wmflabs.org [[phab:T280342|T280342]]
 
=== 2021-07-12 ===
* 16:56 bstorm: deleted job {{Gerrit|4720371}} due to LDAP failure
* 16:51 bstorm: cleared the E state from two job queues
 
=== 2021-07-02 ===
* 18:46 bstorm: cleared error state for tools-sgeexec-0940.tools.eqiad.wmflabs
 
=== 2021-07-01 ===
* 22:08 bstorm: releasing webservice 0.75
* 17:03 andrewbogott: rebooting tools-k8s-worker-[31,33,35,44,49,51,57-58,70].tools.eqiad1.wikimedia.cloud
* 16:47 bstorm: remounted scratch everywhere...but mostly tools [[phab:T224747|T224747]]
* 15:47 arturo: rebased labs/private.git
* 11:04 arturo: added toolforge-jobs-framework-cli_1_all.deb to aptly buster-tools,buster-toolsbeta
* 10:34 arturo: refreshed jobs-api deployment
 
=== 2021-06-29 ===
* 21:58 bstorm: clearing one errored queue and a stack of discarded jobs
* 20:11 majavah: toolforge kubernetes upgrade complete [[phab:T280299|T280299]]
* 17:03 majavah: starting toolforge kubernetes 1.18 upgrade - [[phab:T280299|T280299]]
* 16:17 arturo: deployed jobs-framework-api in the k8s cluster
* 15:34 majavah: remove duplicate definitions from tools-clushmaster-02 /root/.ssh/known_hosts
* 15:12 arturo: livehacking puppetmaster for [[phab:T283238|T283238]]
* 10:24 dcaro: running puppet on the buster bastions after 20000 minutes failing... might break something
 
=== 2021-06-15 ===
* 19:02 bstorm: cleared error status from a few queues
* 16:15 majavah: deleting unused shutdown nodes: tools-checker-03 tools-k8s-haproxy-1 tools-k8s-haproxy-2
 
=== 2021-06-14 ===
* 22:21 bstorm: push docker-registry.tools.wmflabs.org/toolforge-python37-sssd-web:testing to test staged os.execv (and other patches) using toolsbeta toollabs-webservice version 0.75 [[phab:T282975|T282975]]
 
=== 2021-06-13 ===
* 08:15 majavah: clear grid error state from tools-sgeexec-0907, tools-sgeexec-0916, tools-sgeexec-0940
 
=== 2021-06-12 ===
* 14:39 majavah: remove nonexistent tools-prometheus-04 and add tools-prometheus-05 to hiera key "prometheus_nodes"
* 13:53 majavah: create empty bullseye-<nowiki>{</nowiki>tools,toolsbeta<nowiki>}</nowiki> repositories on tools-services-05 aptly
 
=== 2021-06-10 ===
* 17:38 majavah: clear error state from tools-sgeexec-0907, task@tools-sgeexec-0939
 
=== 2021-06-09 ===
* 13:57 majavah: clear error state from exec nodes tools-sgeexec-0913, tools-sgeexec-0936, task@tools-sgeexec-0940
 
=== 2021-06-07 ===
* 18:39 bstorm: cleaning up more error conditions on grid queues
* 17:42 majavah: delete `ingress-nginx` namespace and related objects [[phab:T264221|T264221]]
* 17:37 majavah: remove tools-k8s-ingress-[1-3] from kubernetes, follow-up to https://sal.toolforge.org/log/nd7v2HkB1jz_IcWuCX5M [[phab:T264221|T264221]]
 
=== 2021-06-04 ===
* 21:30 bstorm: deleting "tools-k8s-ingress-3", "tools-k8s-ingress-2", "tools-k8s-ingress-1" [[phab:T264221|T264221]]
* 21:21 bstorm: cleared error state from 4 grid queues
 
=== 2021-06-03 ===
* 18:27 majavah: renew prometheus kubernetes certificate [[phab:T280301|T280301]]
* 17:06 majavah: renew admission webhook certificates [[phab:T280301|T280301]]
 
=== 2021-06-01 ===
* 10:10 majavah: properly clean up deleted vms tools-k8s-haproxy-[1,2], tools-checker-03 from puppet after using the wrong fqdn first time
* 09:54 majavah: clear error state from tools-sgeexec-0913, tools-sgeexec-0950
 
=== 2021-05-30 ===
* 18:58 majavah: clear grid error state from 14 queues
 
=== 2021-05-27 ===
* 18:03 bstorm: adjusted profile::wmcs::kubeadm::etcd_latency_ms from 30 back to the default (10)
* 16:04 bstorm: cleared error state from several exec node queues
* 14:49 andrewbogott: swapping in three new etcd nodes with local storage: tools-k8s-etcd-13,14,15
 
=== 2021-05-24 ===
* 10:36 arturo: rebased labs/private.git after merge conflict
* 06:49 majavah: remove scfc kubernetes admin access after bd808 removed tools.admin membership to avoid maintain-kubeusers crashes when it expires
 
=== 2021-05-22 ===
* 14:47 majavah: manually remove jeh admin certificates and from maintain-kubeusers configmap [[phab:T282725|T282725]]
* 14:32 majavah: manually remove valhallasw yuvipanda admin certificates and from configmap and restart maintain-kubeusers pod [[phab:T282725|T282725]]
* 02:51 bd808: Restarted nginx on tools-static-14 to see if that clears up the fontcdn 502 errors
 
=== 2021-05-21 ===
* 17:06 majavah: unpool tooks-k8s-ingress-[4-6]
* 17:06 majavah: repool tools-k8s-ingress-6
* 17:02 majavah: repool tools-k8s-ingress-4 and -5
* 16:59 bstorm: upgrading the ingress-gen2 controllers to release 3 to capture new RAM/CPU limits
* 16:43 bstorm: resize tools-k8s-ingress-4 to g3.cores4.ram8.disk20
* 16:43 bstorm: resize tools-k8s-ingress-6 to g3.cores4.ram8.disk20
* 16:40 bstorm: resize tools-k8s-ingress-5 to g3.cores4.ram8.disk20
* 16:04 majavah: rollback kubernetes ingress update from front proxy
* 06:52 Majavah: pool tools-k8s-ingress-6 and depool ingress-[2,3] [[phab:T264221|T264221]]
 
=== 2021-05-20 ===
* 17:05 Majavah: pool tools-k8s-ingress-5 as an ingress node, depool ingress-1 [[phab:T264221|T264221]]
* 16:31 Majavah: pool tools-k8s-worker-4 as an ingress node [[phab:T264221|T264221]]
* 15:17 Majavah: trying to install ingress-nginx via helm again after adjusting security groups [[phab:T264221|T264221]]
* 15:15 Majavah: move tools-k8s-ingress-[5-6] from "tools-k8s-full-connectivity" to "tools-new-k8s-full-connectivity" security group [[phab:T264221|T264221]]
 
=== 2021-05-19 ===
* 12:15 Majavah: rollback ingress-nginx-gen2
* 11:09 Majavah: deploy helm-based nginx ingress controller v0.46.0 to ingress-nginx-gen2 namespace [[phab:T264221|T264221]]
* 10:44 Majavah: create tools-k8s-ingress-[4-6] [[phab:T264221|T264221]]
 
=== 2021-05-16 ===
* 16:52 Majavah: clear error state from tools-sgeexec-0905 tools-sgeexec-0907 tools-sgeexec-0936 tools-sgeexec-0941
 
=== 2021-05-14 ===
* 19:18 bstorm: adjusting the rate limits for bastions nfs_write upward a lot to make NFS writes faster now that the cluster is finally using 10Gb on the backend and frontend [[phab:T218338|T218338]]
* 16:55 andrewbogott: rebooting toolserver-proxy-01 to clear up stray files
* 16:47 andrewbogott: deleting log files older than 14 days on toolserver-proxy-01
 
=== 2021-05-12 ===
* 19:45 bstorm: cleared error state from some queues
* 19:05 Majavah: remove phamhi-binding phamhi-view-binding cluster role bindings [[phab:T282725|T282725]]
* 19:04 bstorm: deleted the maintain-kubeusers pod to get it up and running fast [[phab:T282725|T282725]]
* 19:03 bstorm: deleted phamhi from admin configmap in maintain-kubeusers [[phab:T282725|T282725]]
 
=== 2021-05-11 ===
* 17:17 Majavah: shutdown and delete tools-checker-03 [[phab:T278540|T278540]]
* 17:14 Majavah: move floating ip 185.15.56.61 to tools-checker-04
* 17:12 Majavah: add tools-checker-04 as a grid submit host [[phab:T278540|T278540]]
* 16:58 Majavah: add tools-checker-04 to toollabs::checker_hosts hiera key [[phab:T278540|T278540]]
* 16:49 Majavah: creating tools-checker-04 with buster [[phab:T278540|T278540]]
* 16:32 Majavah: carefully shutdown tools-k8s-haproxy-1 [[phab:T252239|T252239]]
* 16:29 Majavah: carefully shutdown tools-k8s-haproxy-2 [[phab:T252239|T252239]]
 
=== 2021-05-10 ===
* 22:58 bstorm: cleared error state on a grid queue
* 22:58 bstorm: setting `profile::wmcs::kubeadm::docker_vol: false` on ingress nodes
* 15:22 Majavah: change k8s.svc.tools.eqiad1.wikimedia.cloud. to point to the tools-k8s-haproxy-keepalived-vip address 172.16.6.113 ([[phab:T252239|T252239]])
* 15:06 Majavah: carefully rolling out keepalived to tools-k8s-haproxy-[3-4] while making sure [1-2] do not have changes
* 15:03 Majavah: clear all error states caused by overloaded exec nodes
* 14:57 arturo: allow tools-k8s-haproxy-[3-4] to use the tools-k8s-haproxy-keepalived-vip address (172.16.6.113) ([[phab:T252239|T252239]])
* 12:53 Majavah: creating tools-k8s-haproxy-[3-4] to rebuild current ones without nfs and with keepalived
 
=== 2021-05-09 ===
* 06:55 Majavah: clear error state from tools-sgeexec-0916
 
=== 2021-05-08 ===
* 10:57 Majavah: import docker image k8s.gcr.io/ingress-nginx/controller:v0.46.0 to local registry as docker-registry.tools.wmflabs.org/nginx-ingress-controller:v0.46.0 [[phab:T264221|T264221]]
 
=== 2021-05-07 ===
* 18:07 Majavah: generate and add k8s haproxy keepalived password (profile::toolforge::k8s::haproxy::keepalived_password) to private puppet repo
* 17:15 bstorm: recreated recordset of k8s.tools.eqiad1.wikimedia.cloud as CNAME to k8s.svc.tools.eqiad1.wikimedia.cloud [[phab:T282227|T282227]]
* 17:12 bstorm: created A record of k8s.svc.tools.eqiad1.wikimedia.cloud pointing at current cluster with TTL of 300 for quick initial failover when the new set of haproxy nodes are ready [[phab:T282227|T282227]]
* 09:44 arturo: `sudo wmcs-openstack --os-project-id=tools port create --network lan-flat-cloudinstances2b tools-k8s-haproxy-keepalived-vip`
 
=== 2021-05-06 ===
* 14:43 Majavah: clear error states from all currently erroring exec nodes
* 14:37 Majavah: clear error state from tools-sgeexec-0913
* 04:35 Majavah: add own root key to project hiera on horizon [[phab:T278390|T278390]]
* 02:36 andrewbogott: removing jhedden from sudo roots
 
=== 2021-05-05 ===
* 19:27 andrewbogott: adding taavi as a sudo root to project toolforge for [[phab:T278390|T278390]]
 
=== 2021-05-04 ===
* 15:23 arturo: upgrading exim4-daemon-heavy in tools-mail-03
* 10:47 arturo: rebase & resolve merge conflicts in labs/private.git
 
=== 2021-05-03 ===
* 16:24 dcaro: started tools-sgeexec-0907, was stuck on initramfs due to an unclean fs (/dev/vda3, root), ran fsck manually fixing all the errors and booted up correctly after ([[phab:T280641|T280641]])
* 14:07 dcaro: depooling tols-sgeexec-0908/7 to be able to restart the VMs as they got stuck during migration ([[phab:T280641|T280641]])
 
=== 2021-04-29 ===
* 18:23 bstorm: removing one more etcd node via cookbook [[phab:T279723|T279723]]
* 18:12 bstorm: removing an etcd node via cookbook [[phab:T279723|T279723]]
 
=== 2021-04-27 ===
* 16:40 bstorm: deleted all the errored out grid jobs stuck in queue wait
* 16:16 bstorm: cleared E status on grid queues to get things flowing again
 
=== 2021-04-26 ===
* 12:17 arturo: allowing more tools into the legacy redirector ([[phab:T281003|T281003]])
 
=== 2021-04-22 ===
* 08:44 Krenair: Removed yuvipanda from roots sudo policy
* 08:42 Krenair: Removed yuvipanda from projectadmin per request
* 08:40 Krenair: Removed yuvipanda from tools.admin per request
 
=== 2021-04-20 ===
* 22:20 bd808: `clush -w @all -b "sudo exiqgrep -z -i {{!}} xargs sudo exim -Mt"`
* 22:19 bd808: `clush -w @exec -b "sudo exiqgrep -z -i {{!}} xargs sudo exim -Mt"`
* 21:52 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad1.wikimedia.cloud`. Was using wrong domain name in prior update.
* 21:49 bstorm: tagged the latest maintain-kubeusers and deployed to toolforge (with kustomize changes to rbac) after testing in toolsbeta [[phab:T280300|T280300]]
* 21:27 bd808: Update hiera `profile::toolforge::active_mail_relay: tools-mail-03.tools.eqiad.wmflabs`. was -2 which is decommed.
* 10:18 dcaro: seting the retention on the tools-prometheus VMs to 250GB (they have 276GB total, leaving some space for online data operations if needed) ([[phab:T279990|T279990]])
 
=== 2021-04-19 ===
* 10:53 dcaro: reverting setting prometheus data source in grafana to 'server', can't connect,
* 10:51 dcaro: setting prometheus data source in grafana to 'server' to avoid CORS issues
 
=== 2021-04-16 ===
* 23:15 bstorm: cleaned up all source files for the grid with the old domain name to enable future node creation [[phab:T277653|T277653]]
* 14:38 dcaro: added 'will get out of space in X days' panel to the dasboard https://grafana-labs.wikimedia.org/goto/kBlGd0uGk ([[phab:T279990|T279990]]), we got <5days xd
* 11:35 arturo: running `grid-configurator --all-domains` which basically added tools-sgebastion-10,11 as submit hosts and removed tools-sgegrid-master,shadow as submit hosts
 
=== 2021-04-15 ===
* 17:45 bstorm: cleared error state from tools-sgeexec-0920.tools.eqiad.wmflabs for a failed job
 
=== 2021-04-13 ===
* 13:26 dcaro: upgrade puppet and python-wmflib on tools-prometheus-03
* 11:23 arturo: deleted shutoff VM tools-package-builder-02 ([[phab:T275864|T275864]])
* 11:21 arturo: deleted shutoff VM tools-sge-services-03,04 ([[phab:T278354|T278354]])
* 11:20 arturo: deleted shutoff VM tools-docker-registry-03,04 ([[phab:T278303|T278303]])
* 11:18 arturo: deleted shutoff VM tools-mail-02 ([[phab:T278538|T278538]])
* 11:17 arturo: deleted shutoff VMs tools-static-12,13 ([[phab:T278539|T278539]])
 
=== 2021-04-11 ===
* 16:07 bstorm: cleared E state from tools-sgeexec-0917 tools-sgeexec-0933 tools-sgeexec-0934 tools-sgeexec-0937 from failures of jobs 761759, 815031, 815056, 855676, 898936
 
=== 2021-04-08 ===
* 18:25 bstorm: cleaned up the deprecated entries in /data/project/.system_sge/gridengine/etc/submithosts for tools-sgegrid-master and tools-sgegrid-shadow using the old fqdns [[phab:T277653|T277653]]
* 09:24 arturo: allocate & associate floating IP 185.15.56.122 for tools-sgebastion-11, also with DNS A record `dev-buster.toolforge.org` ([[phab:T275865|T275865]])
* 09:22 arturo: create DNS A record `login-buster.toolforge.org` pointing to 185.15.56.66 (tools-sgebastion-10) ([[phab:T275865|T275865]])
* 09:20 arturo: associate floating IP 185.15.56.66 to tools-sgebastion-10 ([[phab:T275865|T275865]])
* 09:13 arturo: created tools-sgebastion-11 (buster) ([[phab:T275865|T275865]])
 
=== 2021-04-07 ===
* 04:35 andrewbogott: replacing the mx record '10 mail.tools.wmcloud.org' with '10 mail.tools.wmcloud.org.' — trying to fix axfr for the tools.wmcloud.org zone
 
=== 2021-04-06 ===
* 15:16 bstorm: cleared queue state since a few had "errored" for failed jobs.
* 12:59 dcaro: Removing etcd member tools-k8s-etcd-7.tools.eqiad1.wikimedia.cloud to get an odd number  ([[phab:T267082|T267082]])
* 11:45 arturo: upgrading jobutils & misctools to 1.42 everywhere
* 11:39 arturo: cleaning up aptly: old package versions, old repos (jessie, trusty, precise) etc
* 10:31 dcaro: Removing etcd member tools-k8s-etcd-6.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 10:21 arturo: published jobutils & misctools 1.42 ([[phab:T278748|T278748]])
* 10:21 arturo: published jobutils & misctools 1.42
* 10:21 arturo: aptly repo had some weirdness due to the cinder volume: hardlinks created by aptly were broken, solved with `sudo aptly publish --skip-signing repo stretch-tools -force-overwrite`
* 10:07 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])
* 10:05 arturo: installed aptly from buster-backports on tools-services-05 to see if that makes any difference with an issue when publishing repos
* 09:53 dcaro: Removing etcd member tools-k8s-etcd-4.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 08:55 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])
 
=== 2021-04-05 ===
* 17:02 bstorm: chowned the data volume for the docker registry to docker-registry:docker-registry
* 09:56 arturo: make jhernandez (IRC joakino) projectadmin ([[phab:T278975|T278975]])
 
=== 2021-04-01 ===
* 20:43 bstorm: cleared error state from the grid queues caused by unspecified job errors
* 15:53 dcaro: Removed etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs, adding a new member  ([[phab:T267082|T267082]])
* 15:43 dcaro: Removing etcd member tools-k8s-etcd-5.tools.eqiad.wmflabs  ([[phab:T267082|T267082]])
* 15:36 dcaro: Added new etcd member tools-k8s-etcd-9.tools.eqiad1.wikimedia.cloud ([[phab:T267082|T267082]])
* 15:18 dcaro: adding new etcd member using the cookbook wmcs.toolforge.add_etcd_node ([[phab:T267082|T267082]])
 
=== 2021-03-31 ===
* 15:57 arturo: rebooting `tools-mail-03` after enabling NFS ([[phab:T267082|T267082]], [[phab:T278538|T278538]])
* 15:57 arturo: rebooting `tools-mail-03` after enabling NFS (T
* 15:04 arturo: created MX record for `tools.wmcloud.org` pointing to `mail.tools.wmcloud.org`
* 15:03 arturo: created DNS A record `mail.tools.wmcloud.org` pointing to 185.15.56.63
* 14:56 arturo: shutoff tools-mail-02 ([[phab:T278538|T278538]])
* 14:55 arturo: point floating IP 185.15.56.63 to tools-mail-03 ([[phab:T278538|T278538]])
* 14:45 arturo: created VM `tools-mail-03` as Debian Buster ([[phab:T278538|T278538]])
* 14:39 arturo: relocate some of the hiera keys for email server from project-level to prefix
* 09:44 dcaro: running disk performance test on etcd-4 (round2)
* 09:05 dcaro: running disk performance test on etcd-8
* 08:43 dcaro: running disk performance test on etcd-4
 
=== 2021-03-30 ===
* 16:15 bstorm: added `labstore::traffic_shaping::egress: 800mbps` to tools-static prefix [[phab:T278539|T278539]]
* 15:44 arturo: shutoff tools-static-12/13 ([[phab:T278539|T278539]])
* 15:41 arturo: point horizon web proxy `tools-static.wmflabs.org` to tools-static-14  ([[phab:T278539|T278539]])
* 15:37 arturo: add `mount_nfs: true` to tools-static prefix ([[phab:T2778539|T2778539]])
* 15:26 arturo: create VM tools-static-14 with Debian Buster image ([[phab:T278539|T278539]])
* 12:19 arturo: introduce horizon proxy `deb-tools.wmcloud.org` ([[phab:T278436|T278436]])
* 12:15 arturo: shutdown tools-sgebastion-09 (stretch)
* 11:05 arturo: created VM `tools-sgebastion-10` as Debian Buster ([[phab:T275865|T275865]])
* 11:04 arturo: created server group `tools-bastion` with anti-affinity policy
 
=== 2021-03-28 ===
* 19:31 legoktm: legoktm@tools-sgebastion-08:~$ sudo qdel -f {{Gerrit|9999704}} # [[phab:T278645|T278645]]
 
=== 2021-03-27 ===
* 02:48 Reedy: qdel -f {{Gerrit|9999895}} {{Gerrit|9999799}}
 
=== 2021-03-26 ===
* 12:21 arturo: shutdown tools-package-builder-02 (stretch), we keep -03 which is buster ([[phab:T275864|T275864]])
 
=== 2021-03-25 ===
* 19:30 bstorm: forced deletion of all jobs stuck in a deleting state [[phab:T277653|T277653]]
* 17:46 arturo: rebooting tools-sgeexec-* nodes to account for new grid master ([[phab:T277653|T277653]])
* 16:20 arturo: rebuilding tools-sgegrid-master VM as debian buster ([[phab:T277653|T277653]])
* 16:18 arturo: icinga-downtime toolschecker for 2h
* 16:05 bstorm: failed over the tools grid to the shadow master [[phab:T277653|T277653]]
* 13:36 arturo: shutdown tools-sge-services-03 ([[phab:T278354|T278354]])
* 13:33 arturo: shutdown tools-sge-services-04 ([[phab:T278354|T278354]])
* 13:31 arturo: point aptly clients to `tools-services-05.tools.eqiad1.wikimedia.cloud` (hiera change) ([[phab:T278354|T278354]])
* 12:58 arturo: created VM `tools-services-05` as Debian Buster ([[phab:T278354|T278354]])
* 12:51 arturo: create cinder volume `tools-aptly-data` ([[phab:T278354|T278354]])
 
=== 2021-03-24 ===
* 12:46 arturo: shutoff the old stretch VMs `tools-docker-registry-03` and `tools-docker-registry-04` ([[phab:T278303|T278303]])
* 12:38 arturo: associate floating IP 185.15.56.67 with `tools-docker-registry-05` and refresh FQDN docker-registry.tools.wmflabs.org accordingly ([[phab:T278303|T278303]])
* 12:33 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-05` ([[phab:T278303|T278303]])
* 12:32 arturo: snapshot cinder volume `tools-docker-registry-data` into `tools-docker-registry-data-stretch-migration` ([[phab:T278303|T278303]])
* 12:32 arturo: bump cinder storage quota from 80G to 400G (without quota request task)
* 12:11 arturo: created VM `tools-docker-registry-06` as Debian Buster ([[phab:T278303|T278303]])
* 12:09 arturo: dettach cinder volume `tools-docker-registry-data` ([[phab:T278303|T278303]])
* 11:46 arturo: attach cinder volume `tools-docker-registry-data` to VM `tools-docker-registry-03` to format it and pre-populate it with registry data ([[phab:T278303|T278303]])
* 11:20 arturo: created 80G cinder volume tools-docker-registry-data ([[phab:T278303|T278303]])
* 11:10 arturo: starting VM tools-docker-registry-04 which was stopped probably since 2021-03-09 due to hypervisor draining
 
=== 2021-03-23 ===
* 12:46 arturo: aborrero@tools-sgegrid-master:~$ sudo systemctl restart gridengine-master.service
* 12:16 arturo: delete & re-create VM tools-sgegrid-shadow as Debian Buster ([[phab:T277653|T277653]])
* 12:14 arturo: created puppet prefix 'tools-sgegrid-shadow' and migrated puppet configuration from VM-puppet
* 12:13 arturo: created server group 'tools-grid-master-shadow' with anty-affinity policy
 
=== 2021-03-18 ===
* 19:24 bstorm: set profile::toolforge::infrastructure across the entire project with login_server set on the bastion and exec node-related prefixes
* 16:21 andrewbogott: enabling puppet tools-wide
* 16:20 andrewbogott: disabling puppet tools-wide to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/672456
* 16:19 bstorm: added profile::toolforge::infrastructure class to puppetmaster [[phab:T277756|T277756]]
* 04:12 bstorm: rebooted tools-sgeexec-0935.tools.eqiad.wmflabs because it forgot how to LDAP...likely root cause of the issues tonight
* 03:59 bstorm: rebooting grid master. sorry for the cron spam
* 03:49 bstorm: restarting sssd on tools-sgegrid-master
* 03:37 bstorm: deleted a massive number of stuck jobs that misfired from the cron server
* 03:35 bstorm: rebooting tools-sgecron-01 to try to clear up the ldap-related errors coming out of it
* 01:46 bstorm: killed the toolschecker cron job, which had an LDAP error, and ran it again by hand
 
=== 2021-03-17 ===
* 20:57 bstorm: deployed changes to rbac for kubernetes to add kubectl top access for tools
* 20:26 andrewbogott: moving tools-elastic-3 to cloudvirt1034; two elastic nodes shouldn't be on the same hv
 
=== 2021-03-16 ===
* 16:31 arturo: installing jobutils and misctools 1.41
* 15:55 bstorm: deleted a bunch of messed up grid jobs ({{Gerrit|9989481}},8813,81682,86317,122602,122623,583621,606945,606999)
* 12:32 arturo: add packages jobutils / misctools v1.41 to <nowiki>{</nowiki>stretch,buster<nowiki>}</nowiki>-tools aptly repository in tools-sge-services-03
 
=== 2021-03-12 ===
* 23:13 bstorm: cleared error state for all grid queues
 
=== 2021-03-11 ===
* 17:40 bstorm: deployed metrics-server:0.4.1 to kubernetes
* 16:21 bstorm: add jobutils 1.40 and misctools 1.40 to stretch-tools
* 13:11 arturo: add misctools 1.37 to buster-tools{{!}}toolsbeta aptly repo for [[phab:T275865|T275865]]
* 13:10 arturo: add jobutils 1.40 to buster-tools aptly repo for [[phab:T275865|T275865]]
 
=== 2021-03-10 ===
* 10:56 arturo: briefly stopped VM tools-k8s-etcd-7 to disable VMX cpu flag
 
=== 2021-03-09 ===
* 13:31 arturo: hard-reboot tools-docker-registry-04 because issues related to [[phab:T276922|T276922]]
* 12:34 arturo: briefly rebooting VM tools-docker-registry-04, we need to reboot the hypervisor cloudvirt1038 and failed to migrate away
 
=== 2021-03-05 ===
* 12:30 arturo: started tools-redis-1004 again
* 12:22 arturo: stop tools-redis-1004 to ease draining of cloudvirt1035
 
=== 2021-03-04 ===
* 11:25 arturo: rebooted tools-sgewebgrid-generic-0901, repool it again
* 09:58 arturo: depool tools-sgewebgrid-generic-0901 to reboot VM. It was stuck in MIGRATING state when draining cloudvirt1022
 
=== 2021-03-03 ===
* 15:17 arturo: shutting down tools-sgebastion-07 in an attempt to fix nova state and finish hypervisor migration
* 15:11 arturo: tools-sgebastion-07 triggered a neutron exception (unauthorized) while being live-migrated from cloudvirt1021 to 1029. Resetting nova state with `nova reset-state bd685d48-1011-404e-a755-{{Gerrit|372f6022f345}} --active` and try again
* 14:48 arturo: killed pywikibot instance running in tools-sgebastion-07 by user msyn
 
=== 2021-03-02 ===
* 15:23 bstorm: depooling tools-sgewebgrid-lighttpd-0914.tools.eqiad.wmflabs for reboot. It isn't communicating right
* 15:22 bstorm: cleared queue error states...will need to keep a better eye on what's causing those
 
=== 2021-02-27 ===
* 02:23 bstorm: deployed typo fix to maintain-kubeusers in an innocent effort to make the weekend better [[phab:T275910|T275910]]
* 02:00 bstorm: running a script to repair the dumps mount in all podpresets [[phab:T275371|T275371]]
 
=== 2021-02-26 ===
* 22:04 bstorm: cleaned up grid jobs {{Gerrit|1230666}},{{Gerrit|1908277}},{{Gerrit|1908299}},{{Gerrit|2441500}},{{Gerrit|2441513}}
* 21:27 bstorm: hard rebooting tools-sgeexec-0947
* 21:21 bstorm: hard rebooting tools-sgeexec-0952.tools.eqiad.wmflabs
* 20:01 bd808: Deleted csr in strange state for tool-ores-inspect
 
=== 2021-02-24 ===
* 18:30 bd808: `sudo wmcs-openstack role remove --user zfilipin --project tools user` [[phab:T267313|T267313]]
* 01:04 bstorm: hard rebooting tools-k8s-worker-76 because it's in a sorry state
 
=== 2021-02-23 ===
* 23:11 bstorm: draining a bunch of k8s workers to clean up after dumps changes [[phab:T272397|T272397]]
* 23:06 bstorm: draining tools-k8s-worker-55 to clean up after dumps changes [[phab:T272397|T272397]]
 
=== 2021-02-22 ===
* 20:40 bstorm: repooled tools-sgeexec-0918.tools.eqiad.wmflabs
* 19:09 bstorm: hard rebooted tools-sgeexec-0918 from openstack [[phab:T275411|T275411]]
* 19:07 bstorm: shutting down tools-sgeexec-0918 with the VM's command line (not libvirt directly yet) [[phab:T275411|T275411]]
* 19:05 bstorm: shutting down tools-sgeexec-0918 (with openstack to see what happens) [[phab:T275411|T275411]]
* 19:03 bstorm: depooled tools-sgeexec-0918 [[phab:T275411|T275411]]
* 18:56 bstorm: deleted job {{Gerrit|1962508}} from the grid to clear it up [[phab:T275301|T275301]]
* 16:58 bstorm: cleared error state on several grid queues
 
=== 2021-02-19 ===
* 12:31 arturo: deploying new version of toolforge ingress admission controller
 
=== 2021-02-17 ===
* 21:26 bstorm: deleted tools-puppetdb-01 since it is unused at this time (and undersized anyway)
 
=== 2021-02-04 ===
* 16:27 bstorm: rebooting tools-package-builder-02
 
=== 2021-01-26 ===
* 16:27 bd808: Hard reboot of tools-sgeexec-0906 via Horizon for [[phab:T272978|T272978]]
 
=== 2021-01-22 ===
* 09:59 dcaro: added the record redis.svc.tools.eqiad1.wikimedia.cloud pointing to tools-redis1003 ([[phab:T272679|T272679]])
 
=== 2021-01-21 ===
* 23:58 bstorm: deployed new maintain-kubeusers to tools [[phab:T271847|T271847]]
 
=== 2021-01-19 ===
* 22:57 bstorm: truncated 75GB error log /data/project/robokobot/virgule.err [[phab:T272247|T272247]]
* 22:48 bstorm: truncated 100GB error log /data/project/magnus-toolserver/error.log [[phab:T272247|T272247]]
* 22:43 bstorm: truncated 107GB log '/data/project/meetbot/logs/messages.log' [[phab:T272247|T272247]]
* 22:34 bstorm: truncating 194 GB error log '/data/project/mix-n-match/mnm-microsync.err' [[phab:T272247|T272247]]
* 16:37 bd808: Added Jhernandez to root sudoers group
 
=== 2021-01-14 ===
* 20:56 bstorm: setting bastions to have mostly-uncapped egress network and 40MBps nfs_read for better shared use
* 20:43 bstorm: running tc-setup across the k8s workers
* 20:40 bstorm: running tc-setup across the grid fleet
* 17:58 bstorm: hard rebooting tools-sgecron-01 following network issues during upgrade to stein [[phab:T261134|T261134]]
 
=== 2021-01-13 ===
* 10:02 arturo: delete floating IP allocation 185.15.56.245 ([[phab:T271867|T271867]])
 
=== 2021-01-12 ===
* 18:16 bstorm: deleted wedged CSR tool-adhs-wde to get maintain-kubeusers working again [[phab:T271842|T271842]]
 
=== 2021-01-05 ===
* 18:49 bstorm: changing the limits on k8s etcd nodes again, so disabling puppet on them [[phab:T267966|T267966]]
 
=== 2021-01-04 ===
* 18:21 bstorm: ran 'sudo systemctl stop getty@ttyS1.service && sudo systemctl disable getty@ttyS1.service' on tools-k8s-etcd-5 I have no idea why that keeps coming back.
 
=== 2020-12-22 ===
* 18:22 bstorm: rebooting the grid master because it is misbehaving following the NFS outage
* 10:53 arturo: rebase & resolve ugly git merge conflict in labs/private.git
 
=== 2020-12-18 ===
* 18:37 bstorm: set profile::wmcs::kubeadm::etcd_latency_ms: 15 [[phab:T267966|T267966]]
 
=== 2020-12-17 ===
* 21:42 bstorm: doing the same procedure to increase the timeouts more [[phab:T267966|T267966]]
* 19:56 bstorm: puppet enabled one at a time, letting things catch up. Timeouts are now adjusted to something closer to fsync values [[phab:T267966|T267966]]
* 19:44 bstorm: set etcd timeouts seed value to 20 instead of the default 10 (profile::wmcs::kubeadm::etcd_latency_ms) [[phab:T267966|T267966]]
* 18:58 bstorm: disabling puppet on k8s-etcd servers to alter the timeouts [[phab:T267966|T267966]]
* 14:23 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-4 ([[phab:T267966|T267966]])
* 14:21 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-5 ([[phab:T267966|T267966]])
* 14:19 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-6 ([[phab:T267966|T267966]])
* 14:17 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-7 ([[phab:T267966|T267966]])
* 14:15 arturo: regenerating puppet cert with proper alt names in tools-k8s-etcd-8 ([[phab:T267966|T267966]])
* 14:12 arturo: updated kube-apiserver manifest with new etcd nodes ([[phab:T267966|T267966]])
* 13:56 arturo: adding etcd dns_alt_names hiera keys to the puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/beb27b45a74765a64552f2d4f70a40b217b4f4e9%5E%21/
* 13:12 arturo: making k8s api server aware of the new etcd nodes via hiera update https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/3761c4c4dab1c3ed0ab0a1133d2ccf3df6c28baf%5E%21/ ([[phab:T267966|T267966]])
* 12:54 arturo: joining new etcd nodes in the k8s etcd cluster ([[phab:T267966|T267966]])
* 12:52 arturo: adding more etcd nodes in the hiera key in tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/b4f60768078eccdabdfab4cd99c7c57076de51b2
* 12:50 arturo: dropping more unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/e9e66a6787d9b91c08cf4742a27b90b3e6d05aac
* 12:49 arturo: dropping unused hiera keys in the tools-k8s-etcd puppet prefix https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/2b4cb4a41756e602fb0996e7d0210e9102172424
* 12:16 arturo: created VM `tools-k8s-etcd-8` ([[phab:T267966|T267966]])
* 12:15 arturo: created VM `tools-k8s-etcd-7` ([[phab:T267966|T267966]])
* 12:13 arturo: created `tools-k8s-etcd` anti-affinity server group
 
=== 2020-12-11 ===
* 18:29 bstorm: certificatesigningrequest.certificates.k8s.io "tool-production-error-tasks-metrics" deleted to stop maintain-kubeusers issues
* 12:14 dcaro: upgrading stable/main (clinic duty)
* 12:12 dcaro: upgrading buster-wikimedia/main (clinic duty)
* 12:03 dcaro: upgrading stable-updates/main, mainly cacertificates (clinic duty)
* 12:01 dcaro: upgrading stretch-backports/main, mainly libuv (clinic duty)
* 11:58 dcaro: disabled all the repos blocking upgrades on tools-package-builder-02 (duplicated, other releases...)
* 11:35 arturo: uncordon tools-k8s-worker-71 and tools-k8s-worker-55, they weren't uncordoned yesterday for whatever reasons ([[phab:T263284|T263284]])
* 11:27 dcaro: upgrading stretch-wikimedia/main (clinic duty)
* 11:20 dcaro: upgrading stretch-wikimedia/thirdparty/mono-project-stretch (clinic duty)
* 11:08 dcaro: upgrade stretch-wikimedia/component/php72 (minor upgrades) (clinic duty)
* 11:04 dcaro: upgrade oldstable/main packages (clinic duty)
* 10:58 dcaro: upgrade kubectl done (clinic duty)
* 10:53 dcaro: upgrade kubectl (clinic duty)
* 10:16 dcaro: upgrading oldstable/main packages (clinic duty)
 
=== 2020-12-10 ===
* 17:35 bstorm: k8s-control nodes upgraded to 1.17.13 [[phab:T263284|T263284]]
* 17:16 arturo: k8s control nodes were all upgraded to 1.17, now upgrading worker nodes ([[phab:T263284|T263284]])
* 15:50 dcaro: puppet upgraded to 5.5.10 on the hosts, ping me if you see anything weird (clinic duty)
* 15:41 arturo: icinga-downtime toolschecker for 2h ([[phab:T263284|T263284]])
* 15:35 dcaro: Puppet 5 on tools-sgebastion-09 ran well and without issues, upgrading the other sge nodes (clinic duty)
* 15:32 dcaro: Upgrading puppet from 4 to 5 on tools-sgebastion-09 (clinic duty)
* 12:41 arturo: set hiera `profile::wmcs::kubeadm::component: thirdparty/kubeadm-k8s-1-17` in project & tools-k8s-control prefix ([[phab:T263284|T263284]])
* 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade ([[phab:T263284|T263284]])
* 11:50 arturo: disabled puppet in all k8s nodes in preparation for version upgrade ([[phab:T263284|T263284]])
* 09:58 dcaro: successful tesseract upgrade on tools-sgewebgrid-lighttpd-0914, upgrading the rest of nodes (clinic duty)
* 09:49 dcaro: upgrading tesseract on tools-sgewebgrid-lighttpd-0914 (clinic duty)
 
=== 2020-12-08 ===
* 19:01 bstorm: pushed updated calico node image (v3.14.0) to internal docker registry as well [[phab:T269016|T269016]]
 
=== 2020-12-07 ===
* 22:56 bstorm: pushed updated local copies of the typha, calico-cni and calico-pod2daemon-flexvol images to the tools internal registry [[phab:T269016|T269016]]
 
=== 2020-12-03 ===
* 09:18 arturo: restarted kubelet systemd service on tools-k8s-worker-38. Node was NotReady, complaining about 'use of closed network connection'
* 09:16 arturo: restarted kubelet systemd service on tools-k8s-worker-59. Node was NotReady, complaining about 'use of closed network connection'
 
=== 2020-11-28 ===
* 23:35 Krenair: Re-scheduled 4 continuous jobs from tools-sgeexec-0908 as it appears to be broken, at about 23:20 UTC
* 04:35 Krenair: Ran `sudo -i kubectl -n tool-mdbot delete cm maintain-kubeusers` on tools-k8s-control-1 for [[phab:T268904|T268904]], seems to have regenerated ~tools.mdbot/.kube/config
 
=== 2020-11-24 ===
* 17:44 arturo: rebased labs/private.git. 2 patches had merge conflicts
* 16:36 bd808: clush -w @all -b 'sudo -i apt-get purge nscd'
* 16:31 bd808: Ran `sudo -i apt-get purge nscd` on tools-sgeexec-0932 to try and fix apt state for puppet
 
=== 2020-11-10 ===
* 19:45 andrewbogott: rebooting  tools-sgeexec-0950; OOM
 
=== 2020-11-02 ===
* 13:35 arturo: (typo: dcaro)
* 13:35 arturo: added dcar as projectadmin & user ([[phab:T266068|T266068]])
 
=== 2020-10-29 ===
* 21:33 legoktm: published docker-registry.tools.wmflabs.org/toolbeta-test image ([[phab:T265681|T265681]])
* 21:10 bstorm: Added another ingress node to k8s cluster in case the load spikes are the problem [[phab:T266506|T266506]]
* 17:33 bstorm: hard rebooting tools-sgeexec-0905 and tools-sgeexec-0916 to get the grid back to full capacity
* 04:03 legoktm: published docker-registry.tools.wmflabs.org/toolforge-buster0-builder:latest image ([[phab:T265686|T265686]])
 
=== 2020-10-28 ===
* 23:42 bstorm: dramatically elevated the egress cap on tools-k8s-ingress nodes that were affected by the NFS settings [[phab:T266506|T266506]]
* 22:10 bstorm: launching tools-k8s-ingress-3 to try and get an NFS-free node [[phab:T266506|T266506]]
* 21:58 bstorm: set 'mount_nfs: false' on the tools-k8s-ingress prefix [[phab:T266506|T266506]]
 
=== 2020-10-23 ===
* 22:22 legoktm: imported pack_0.14.2-1_amd64.deb into buster-tools ([[phab:T266270|T266270]])
 
=== 2020-10-21 ===
* 17:58 legoktm: pushed toolforge-buster0-<nowiki>{</nowiki>build,run<nowiki>}</nowiki>:latest images to docker registry
 
=== 2020-10-15 ===
* 22:00 bstorm: manually removing nscd from tools-sgebastion-08 and running puppet
* 18:23 andrewbogott: uncordoning tools-k8s-worker-53, 54, 55, 59
* 17:28 andrewbogott: depooling tools-k8s-worker-53, 54, 55, 59
* 17:27 andrewbogott: uncordoning tools-k8s-worker-35, 37, 45
* 16:44 andrewbogott: depooling tools-k8s-worker-35, 37, 45
 
=== 2020-10-14 ===
* 21:00 andrewbogott: repooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 20:37 andrewbogott: depooling tools-sgewebgrid-generic-0901 and tools-sgewebgrid-lighttpd-0915
* 20:35 andrewbogott: repooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16
* 20:31 bd808: Deployed toollabs-webservice v0.74
* 19:53 andrewbogott: depooling tools-sgewebgrid-lighttpd-0911, 12, 13, 16 and moving to Ceph
* 19:47 andrewbogott: repooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:07 andrewbogott: depooling tools-sgeexec-0932, 33, 34 and moving to Ceph
* 19:06 andrewbogott: repooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
* 16:56 andrewbogott: depooling tools-sgeexec-0935, 36, 38, 40 and moving to Ceph
 
=== 2020-10-10 ===
* 17:07 bstorm: cleared errors on tools-sgeexec-0912.tools.eqiad.wmflabs to get the queue moving again
 
=== 2020-10-08 ===
* 17:07 bstorm: rebuilding docker images with locales-all [[phab:T263339|T263339]]
 
=== 2020-10-06 ===
* 19:04 andrewbogott: uncordoned tools-k8s-worker-38
* 18:51 andrewbogott: uncordoned tools-k8s-worker-52
* 18:40 andrewbogott: draining and cordoning tools-k8s-worker-52 and tools-k8s-worker-38 for ceph migration
 
=== 2020-10-02 ===
* 21:09 bstorm: rebooting tools-k8s-worker-70 because it seems to be unable to recover from an old NFS disconnect
* 17:37 andrewbogott: stopping tools-prometheus-03 to attempt a snapshot
* 16:03 bstorm: shutting down tools-prometheus-04 to try to fsck the disk
 
=== 2020-10-01 ===
* 21:39 andrewbogott: migrating tools-proxy-06 to ceph
* 21:35 andrewbogott: moving  k8s.tools.eqiad1.wikimedia.cloud from 172.16.0.99 (toolsbeta-test-k8s-haproxy-1) to 172.16.0.108 (toolsbeta-test-k8s-haproxy-2) in anticipation of downtime for haproxy-1 tomorrow
 
=== 2020-09-30 ===
* 18:34 andrewbogott: repooling tools-sgeexec-0918
* 18:29 andrewbogott: depooling tools-sgeexec-0918 so I can reboot cloudvirt1036
 
=== 2020-09-23 ===
* 21:38 bstorm: ran an 'apt clean' across the fleet to get ahead of the new locale install
 
=== 2020-09-18 ===
* 19:41 andrewbogott: repooling tools-k8s-worker-30, 33, 34, 57, 60
* 19:04 andrewbogott: depooling tools-k8s-worker-30, 33, 34, 57, 60
* 19:02 andrewbogott: repooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 17:48 andrewbogott: depooling tools-k8s-worker-41, 43, 44, 47, 48, 49, 50, 51
* 17:47 andrewbogott: repooling tools-k8s-worker-31, 32, 36, 39, 40
* 16:40 andrewbogott: depooling tools-k8s-worker-31, 32, 36, 39, 40
* 16:38 andrewbogott: repooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 16:10 andrewbogott: depooling tools-sgewebgrid-lighttpd-0914, tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0919, tools-sgewebgrid-lighttpd-0918
* 13:54 andrewbogott: repooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916
* 13:50 andrewbogott: depooling tools-sgeexec-0913, tools-sgeexec-0915, tools-sgeexec-0916  for flavor update
* 01:20 andrewbogott: repooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  after flavor update
* 01:11 andrewbogott: depooling tools-sgeexec-0901, tools-sgeexec-0905, tools-sgeexec-0910, tools-sgeexec-0911, tools-sgeexec-0912  for flavor update
* 01:08 andrewbogott: repooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  after flavor update
* 01:00 andrewbogott: depooling tools-sgeexec-0917, tools-sgeexec-0918, tools-sgeexec-0919, tools-sgeexec-0920  for flavor update
* 00:58 andrewbogott: repooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 after flavor update
* 00:49 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
 
=== 2020-09-17 ===
* 21:56 bd808: Built and deployed tools-manifest v0.22 ([[phab:T263190|T263190]])
* 21:55 bd808: Built and deployed tools-manifest v0.22 ([[phab:T169695|T169695]])
* 20:34 bd808: Live hacked "--backend=gridengine" into webservicemonitor on tools-sgecron-01 ([[phab:T263190|T263190]])
* 20:21 bd808: Restarted webservicemonitor on tools-sgecron-01.tools.eqiad.wmflabs
* 20:09 andrewbogott: I didn't actually depool tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 because there was some kind of brief outage just now
* 19:58 andrewbogott: depooling tools-sgeexec-0942, tools-sgeexec-0947, tools-sgeexec-0950, tools-sgeexec-0951, tools-sgeexec-0952 for flavor update
* 19:55 andrewbogott: repooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 19:29 andrewbogott: depooling tools-k8s-worker-61,62,64,65,67,68,69 for flavor update
* 15:38 andrewbogott: repooling tools-k8s-worker-70 and tools-k8s-worker-66 after flavor remapping
* 15:34 andrewbogott: depooling tools-k8s-worker-70 and tools-k8s-worker-66 for flavor remapping
* 15:30 andrewbogott: repooling tools-sgeexec-0909, 0908, 0907, 0906, 0904
* 15:21 andrewbogott: depooling tools-sgeexec-0909, 0908, 0907, 0906, 0904 for flavor remapping
* 13:55 andrewbogott: depooled tools-sgewebgrid-lighttpd-0917 and tools-sgewebgrid-lighttpd-0920
* 13:55 andrewbogott: repooled tools-sgeexec-0937 after move to ceph
* 13:45 andrewbogott: depooled tools-sgeexec-0937 for move to ceph
 
=== 2020-09-16 ===
* 23:20 andrewbogott: repooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 23:03 andrewbogott: depooled tools-sgeexec-0941 and tools-sgeexec-0939 for move to ceph
* 23:02 andrewbogott: uncordoned tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 22:29 andrewbogott: draining tools-k8s-worker-58, tools-k8s-worker-56, tools-k8s-worker-42 for migration to ceph
* 17:37 andrewbogott: service gridengine-master restart on tools-sgegrid-master
 
=== 2020-09-10 ===
* 15:37 arturo: hard-rebooting tools-proxy-05
* 15:33 arturo: rebooting tools-proxy-05 to try flushing local DNS caches
* 15:25 arturo: detected missing DNS record for k8s.tools.eqiad1.wikimedia.cloud which means the k8s cluster is down
* 10:22 arturo: enabling ingress dedicated worker nodes in the k8s cluster ([[phab:T250172|T250172]])
 
=== 2020-09-09 ===
* 11:12 arturo: new ingress nodes added to the cluster, and tainted/labeled per the docs https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Deploying#ingress_nodes ([[phab:T250172|T250172]])
* 10:50 arturo: created puppet prefix `tools-k8s-ingress` ([[phab:T250172|T250172]])
* 10:42 arturo: created VMs tools-k8s-ingress-1 and tools-k8s-ingress-2 in the `tools-ingress` server group [[phab:T250172|T250172]])
* 10:38 arturo: created server group `tools-ingress` with soft anti affinity policy ([[phab:T250172|T250172]])
 
=== 2020-09-08 ===
* 23:24 bstorm: clearing grid queue error states blocking job runs
* 22:53 bd808: forcing puppet run on tools-sgebastion-07
 
=== 2020-09-02 ===
* 18:13 andrewbogott: moving tools-sgeexec-0920  to ceph
* 17:57 andrewbogott: moving tools-sgeexec-0942  to ceph
 
=== 2020-08-31 ===
* 19:58 andrewbogott: migrating tools-sgeexec-091[0-9] to ceph
* 17:19 andrewbogott: migrating tools-sgeexec-090[4-9] to ceph
* 17:19 andrewbogott: repooled tools-sgeexec-0901
* 16:52 bstorm: `apt install uwsgi` was run on tools-checker-03 in the last log [[phab:T261677|T261677]]
* 16:51 bstorm: running `apt install uwsgi` with --allow-downgrades to fix the puppet setup there [[phab:T261677|T261677]]
* 14:26 andrewbogott: depooling tools-sgeexec-0901, migrating to ceph
 
=== 2020-08-30 ===
* 00:57 Krenair: also ran qconf -ds on each
* 00:35 Krenair: Tidied up SGE problems (it was spamming root@ every minute for hours) following host deletions some hours ago - removed tools-sgeexec-0921 through 0931 from @general, ran qmod -rj on all jobs registered for those nodes, then qdel -f on the remainders, then qconf -de on each deleted node
 
=== 2020-08-29 ===
* 16:02 bstorm: deleting "tools-sgeexec-0931", "tools-sgeexec-0930", "tools-sgeexec-0929", "tools-sgeexec-0928", "tools-sgeexec-0927"
* 16:00 bstorm: deleting  "tools-sgeexec-0926", "tools-sgeexec-0925", "tools-sgeexec-0924", "tools-sgeexec-0923", "tools-sgeexec-0922", "tools-sgeexec-0921"
 
=== 2020-08-26 ===
* 21:08 bd808: Disabled puppet on tools-proxy-06 to test fixes for a bug in the new [[phab:T251628|T251628]] code
* 08:54 arturo: merged several patches by bryan for toolforge front proxy (cleanups, etc) example: https://gerrit.wikimedia.org/r/c/operations/puppet/+/622435
 
=== 2020-08-25 ===
* 19:38 andrewbogott: deleting tools-sgeexec-0943.tools.eqiad.wmflabs, tools-sgeexec-0944.tools.eqiad.wmflabs, tools-sgeexec-0945.tools.eqiad.wmflabs, tools-sgeexec-0946.tools.eqiad.wmflabs, tools-sgeexec-0948.tools.eqiad.wmflabs, tools-sgeexec-0949.tools.eqiad.wmflabs, tools-sgeexec-0953.tools.eqiad.wmflabs — they are broken and we're not very curious why; will retry this exercise when everything is standardized on
* 15:03 andrewbogott: removing non-ceph nodes tools-sgeexec-0921 through tools-sgeexec-0931
* 15:02 andrewbogott: added new sge-exec nodes tools-sgeexec-0943 through tools-sgeexec-0953 (for real this time)
 
=== 2020-08-19 ===
* 21:29 andrewbogott: shutting down and removing  tools-k8s-worker-20 through tools-k8s-worker-29; this load can now be handled by new nodes on ceph hosts
* 21:15 andrewbogott: shutting down and removing  tools-k8s-worker-1 through tools-k8s-worker-19; this load can now be handled by new nodes on ceph hosts
* 18:40 andrewbogott: creating 13 new xlarge k8s worker nodes, tools-k8s-worker-67 through tools-k8s-worker-79
 
=== 2020-08-18 ===
* 15:24 bd808: Rebuilding all Docker containers to pick up newest versions of installed packages
 
=== 2020-07-30 ===
* 16:28 andrewbogott: added new xlarge ceph-hosted worker nodes: tools-k8s-worker-61, 62, 63, 64, 65, 66.  [[phab:T258663|T258663]]
 
=== 2020-07-29 ===
* 23:24 bd808: Pushed a copy of docker-registry.wikimedia.org/wikimedia-jessie:latest to docker-registry.tools.wmflabs.org/wikimedia-jessie:latest in preparation for the upstream image going away
 
=== 2020-07-24 ===
* 22:33 bd808: Removed a few more ancient docker images: grrrit, jessie-toollabs, and nagf
* 21:02 bd808: Running cleanup script to delete the non-sssd toolforge images from docker-registry.tools.wmflabs.org
* 20:17 bd808: Forced garbage collection on docker-registry.tools.wmflabs.org
* 20:06 bd808: Running cleanup script to delete all of the old toollabs-* images from docker-registry.tools.wmflabs.org
 
=== 2020-07-22 ===
* 23:24 bstorm: created server group 'tools-k8s-worker' to create any new worker nodes in so that they have a low chance of being scheduled together by openstack unless it is necessary [[phab:T258663|T258663]]
* 23:22 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[56-60] [[phab:T257945|T257945]]
* 23:17 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[41-55] [[phab:T257945|T257945]]
* 23:14 bstorm: running puppet and NFS 4.2 remount on tools-k8s-worker-[21-40] [[phab:T257945|T257945]]
* 23:11 bstorm: running puppet and NFS remount on tools-k8s-worker-[1-15] [[phab:T257945|T257945]]
* 23:07 bstorm: disabling puppet on k8s workers to reduce the effect of changing the NFS mount version all at once [[phab:T257945|T257945]]
* 22:28 bstorm: setting tools-k8s-control prefix to mount NFS v4.2 [[phab:T257945|T257945]]
* 22:15 bstorm: set the tools-k8s-control nodes to also use 800MBps to prevent issues with toolforge ingress and api system
* 22:07 bstorm: set the tools-k8s-haproxy-1 (main load balancer for toolforge) to have an egress limit of 800MB per sec instead of the same as all the other servers
 
=== 2020-07-21 ===
* 16:09 bstorm: rebooting tools-sgegrid-shadow to remount NFS correctly
* 15:55 bstorm: set the bastion prefix to have explicitly set hiera value of profile::wmcs::nfsclient::nfs_version: '4'
 
=== 2020-07-17 ===
* 16:47 bd808: Enabled Puppet on tools-proxy-06 following successful test ([[phab:T102367|T102367]])
* 16:29 bd808: Disabled Puppet on tools-proxy-06 to test nginx config changes manually ([[phab:T102367|T102367]])
 
=== 2020-07-15 ===
* 23:11 bd808: Removed ssh root key for valhallasw from project hiera ([[phab:T255697|T255697]])
 
=== 2020-07-09 ===
* 18:53 bd808: Updating git-review to 1.27 via clush across cluster ([[phab:T257496|T257496]])
 
=== 2020-07-08 ===
* 11:16 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 -- important change to front-proxy ([[phab:T234617|T234617]])
* 11:11 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/610029 ([[phab:T234617|T234617]])
 
=== 2020-07-07 ===
* 23:22 bd808: Rebuilding all Docker images to pick up webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 23:19 bd808: Deploying webservice v0.73 via clush ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 23:16 bd808: Building webservice v0.73 ([[phab:T234617|T234617]], [[phab:T257229|T257229]])
* 15:01 Reedy: killed python process from tools.experimental-embeddings using a lot of cpu on tools-sgebastion-07
* 15:01 Reedy: killed meno25 process running pwb.py on tools-sgebastion-07
* 09:59 arturo: point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) ([[phab:T247236|T247236]])
 
=== 2020-07-06 ===
* 11:54 arturo: briefly point DNS tools.wmflabs.org A record to 185.15.56.60 (tools-legacy-redirector) and then switch back to 185.15.56.11 (tools-proxy-05). The legacy redirector does HTTP/307 ([[phab:T247236|T247236]])
* 11:50 arturo: associate floating IP address 185.15.56.60 to tools-legacy-redirector ([[phab:T247236|T247236]])
 
=== 2020-07-01 ===
* 11:19 arturo: cleanup exim email queue (4 frozen messages)
* 11:02 arturo: live-hacking puppetmaster with https://gerrit.wikimedia.org/r/c/operations/puppet/+/608849 ([[phab:T256737|T256737]])
 
=== 2020-06-30 ===
* 11:18 arturo: set some hiera keys for mtail in puppet prefix `tools-mail` ([[phab:T256737|T256737]])
 
=== 2020-06-29 ===
* 22:48 legoktm: built html-sssd/web image ([[phab:T241817|T241817]])
* 22:23 legoktm: rebuild python<nowiki>{</nowiki>34,35,37<nowiki>}</nowiki>-sssd/web images for https://gerrit.wikimedia.org/r/608093
* 12:01 arturo: introduced spam filter in the mail server ([[phab:T120210|T120210]])
 
=== 2020-06-25 ===
* 21:49 zhuyifei1999_: re-enabling puppet on tools-sgebastion-09 [[phab:T256426|T256426]]
* 21:39 zhuyifei1999_: disabling puppet on tools-sgebastion-09 so I can play with mount settings [[phab:T256426|T256426]]
* 21:24 bstorm: hard rebooting tools-sgebastion-09
 
=== 2020-06-24 ===
* 12:36 arturo: live-hacking puppetmaster with exim prometheus stuff ([[phab:T175964|T175964]])
* 11:57 arturo: merging email ratelimiting patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/607320 ([[phab:T175964|T175964]])
 
=== 2020-06-23 ===
* 17:55 arturo: killed procs for users `hamishz` and `msyn` which apparently were tools that should be running in the grid / kubernetes instead
* 16:08 arturo: created acme-chief cert `tools_mail` in the prefix hiera
 
=== 2020-06-17 ===
* 10:40 arturo: created VM tools-legacy-redirector, with the corresponding puppet prefix ([[phab:T247236|T247236]], [[phab:T234617|T234617]])
 
=== 2020-06-16 ===
* 23:01 bd808: Building new Docker images to pick up webservice 0.72
* 22:58 bd808: Deploying webservice 0.72 to bastions and grid
* 22:56 bd808: Building webservice 0.72
* 15:10 arturo: merging a patch with changes to the template for keepalived (used in the elastic cluster) https://gerrit.wikimedia.org/r/c/operations/puppet/+/605898
 
=== 2020-06-15 ===
* 21:28 bstorm_: cleaned up killgridjobs.sh on the tools bastions [[phab:T157792|T157792]]
* 18:14 bd808: Rebuilding all Docker images to pick up webservice 0.71 ([[phab:T254640|T254640]], [[phab:T253412|T253412]])
* 18:12 bd808: Deploying webservice 0.71 to bastions and grid via clush
* 18:05 bd808: Building webservice 0.71
 
=== 2020-06-12 ===
* 13:13 arturo: live-hacking session in the puppetmaster ended
* 13:10 arturo: live-hacing puppet tree in tools-puppetmaster-02 for testing PAWS related patch (they share haproxy puppet code)
* 00:16 bstorm_: remounted NFS for tools-k8s-control-3 and tools-acme-chief-01
 
=== 2020-06-11 ===
* 23:35 bstorm_: rebooting tools-k8s-control-2 because it seems to be confused on NFS, interestingly enough
 
=== 2020-06-04 ===
* 13:32 bd808: Manually restored /etc/haproxy/conf.d/elastic.cfg on tools-elastic-*
 
=== 2020-06-02 ===
* 12:23 arturo: renewed TLS cert for k8s metrics-server ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#internal_API_access
* 11:00 arturo: renewed TLS cert for prometheus to contact toolforge k8s ([[phab:T250874|T250874]]) following docs: https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes/Certificates#external_API_access
 
=== 2020-06-01 ===
* 23:51 bstorm_: refreshed certs for the custom webhook controllers on the k8s cluster [[phab:T250874|T250874]]
* 00:39 bd808: Ugh. Prior SAL message was about tools-sgeexec-0940
* 00:39 bd808: Compressed /var/log/account/pacct.0 ahead of rotation schedule to free some space on the root partition
 
=== 2020-05-29 ===
* 19:37 bstorm_: adding docker image for paws-public docker-registry.tools.wmflabs.org/paws-public-nginx:openresty [[phab:T252217|T252217]]
 
=== 2020-05-28 ===
* 21:19 bd808: Killed 7 python processes run by user 'mattho69' on login.toolforge.org
* 21:06 bstorm_: upgrading tools-k8s-worker-[30-60] to kubernetes 1.16.10 [[phab:T246122|T246122]]
* 17:54 bstorm_: upgraded tools-k8s-worker-[11..15] and starting on -21-29 now [[phab:T246122|T246122]]
* 16:01 bstorm_: kubectl upgraded to 1.16.10 on all bastions [[phab:T246122|T246122]]
* 15:58 arturo: upgrading tools-k8s-worker-[1..10] to 1.16.10 ([[phab:T246122|T246122]])
* 15:41 arturo: upgrading tools-k8s-control-3 to 1.16.10 ([[phab:T246122|T246122]])
* 15:17 arturo: upgrading tools-k8s-control-2 to 1.16.10 ([[phab:T246122|T246122]])
* 15:09 arturo: upgrading tools-k8s-control-1 to 1.16.10 ([[phab:T246122|T246122]])
* 14:49 arturo: cleanup /etc/apt/sources.list.d/ directory in all tools-k8s-* VMs
* 11:27 arturo: merging change to front-proxy: https://gerrit.wikimedia.org/r/c/operations/puppet/+/599139 ([[phab:T253816|T253816]])
 
=== 2020-05-27 ===
* 17:23 bstorm_: deleting "tools-k8s-worker-20", "tools-k8s-worker-19", "tools-k8s-worker-18", "tools-k8s-worker-17", "tools-k8s-worker-16"
 
=== 2020-05-26 ===
* 18:45 bstorm_: upgrading maintain-kubeusers to match what is in toolsbeta [[phab:T246059|T246059]] [[phab:T211096|T211096]]
* 16:20 bstorm_: fix incorrect volume name in kubeadm-config configmap [[phab:T246122|T246122]]
 
=== 2020-05-22 ===
* 20:00 bstorm_: rebooted tools-sgebastion-07 to clear up tmp file problems with 10 min warning
* 19:12 bstorm_: running command to delete over 2000 tmp ca certs on tools-bastion-07 [[phab:T253412|T253412]]
 
=== 2020-05-21 ===
* 22:40 bd808: Rebuilding all Docker containers for tools-webservice 0.70 ([[phab:T252700|T252700]])
* 22:36 bd808: Updated tools-webservice to 0.70 across instances ([[phab:T252700|T252700]])
* 22:29 bd808: Building tools-webservice 0.70 via wmcs-package-build.py
 
=== 2020-05-20 ===
* 09:59 arturo: now running tesseract-ocr v4.1.1-2~bpo9+1 in the Toolforge grid ([[phab:T247422|T247422]])
* 09:50 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'apt-get install tesseract-ocr -t stretch-backports -y'` ([[phab:T247422|T247422]])
* 09:35 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/sources.lists.d/kubeadm-k8s-component-repo.list ; rm /etc/apt/sources.list.d/repository_thirdparty-kubeadm-k8s-1-15.list ; run-puppet-agent'` ([[phab:T247422|T247422]])
* 09:23 arturo: `aborrero@cloud-cumin-01:~$ sudo cumin --force -x 'O<nowiki>{</nowiki>project:tools name:tools-sge[bcew].*<nowiki>}</nowiki>' 'rm /etc/apt/preferences.d/* ; run-puppet-agent'` ([[phab:T247422|T247422]])
 
=== 2020-05-19 ===
* 17:00 bstorm_: deleting/restarting the paws db-proxy pod because it cannot connect to the replicas...and I'm hoping that's due to depooling and such
 
=== 2020-05-13 ===
* 18:14 bstorm_: upgrading calico to 3.14.0 with typha enabled in Toolforge K8s [[phab:T250863|T250863]]
* 18:10 bstorm_: set "profile::toolforge::k8s::typha_enabled: true" in tools project for calico upgrade [[phab:T250863|T250863]]
 
=== 2020-05-09 ===
* 00:28 bstorm_: added nfs.* to ignored_fs_types for the prometheus::node_exporter params in project hiera [[phab:T252260|T252260]]
 
=== 2020-05-08 ===
* 18:17 bd808: Building all jessie-sssd derived images ([[phab:T197930|T197930]])
* 17:29 bd808: Building new jessie-sssd base image ([[phab:T197930|T197930]])
 
=== 2020-05-07 ===
* 21:51 bstorm_: rebuilding the docker images for Toolforge k8s
* 19:03 bstorm_: toollabs-webservice 0.69 now pushed to the Toolforge bastions
* 18:57 bstorm_: pushing new toollabs-webservice package v0.69 to the tools repos
 
=== 2020-05-06 ===
* 21:20 bd808: Kubectl delete node tools-k8s-worker-[16-20] ([[phab:T248702|T248702]])
* 18:24 bd808: Updated "profile::toolforge::k8s::worker_nodes" list in "tools-k8s-haproxy" prefix puppet ([[phab:T248702|T248702]])
* 18:14 bd808: Shutdown tools-k8s-worker-[16-20] instances ([[phab:T248702|T248702]])
* 18:04 bd808: Draining tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 17:56 bd808: Cordoned tools-k8s-worker-[16-20] in preparation for decomm ([[phab:T248702|T248702]])
* 00:01 bd808: Joining tools-k8s-worker-60 to the k8s worker pool
* 00:00 bd808: Joining tools-k8s-worker-59 to the k8s worker pool
 
=== 2020-05-05 ===
* 23:58 bd808: Joining tools-k8s-worker-58 to the k8s worker pool
* 23:55 bd808: Joining tools-k8s-worker-57 to the k8s worker pool
* 23:53 bd808: Joining tools-k8s-worker-56 to the k8s worker pool
* 21:51 bd808: Building 5 new k8s worker nodes ([[phab:T248702|T248702]])
 
=== 2020-05-04 ===
* 22:08 bstorm_: deleting tools-elastic-01/2/3 [[phab:T236606|T236606]]
* 16:46 arturo: removing the now unused `/etc/apt/preferences.d/toolforge_k8s_kubeadmrepo*` files ([[phab:T250866|T250866]])
* 16:43 arturo: removing the now unused `/etc/apt/sources.list.d/toolforge-k8s-kubeadmrepo.list` file ([[phab:T250866|T250866]])
 
=== 2020-04-29 ===
* 22:13 bstorm_: running a fixup script after fixing a bug [[phab:T247455|T247455]]
* 21:28 bstorm_: running the rewrite-psp-preset.sh script across all tools [[phab:T247455|T247455]]
* 16:54 bstorm_: deleted the maintain-kubeusers pod to start running the new image [[phab:T247455|T247455]]
* 16:52 bstorm_: tagged docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to latest to deploy to toolforge [[phab:T247455|T247455]]
 
=== 2020-04-28 ===
* 22:58 bstorm_: rebuilding docker-registry.tools.wmflabs.org/maintain-kubeusers:beta [[phab:T247455|T247455]]
 
=== 2020-04-23 ===
* 19:22 bd808: Increased Kubernetes services quota for bd808-test tool.
 
=== 2020-04-21 ===
* 23:06 bstorm_: repooled tools-k8s-worker-38/52, tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 22:09 bstorm_: depooling tools-sgewebgrid-lighttpd-0918/9 and tools-sgeexec-0901 [[phab:T250869|T250869]]
* 22:02 bstorm_: draining tools-k8s-worker-38 and tools-k8s-worker-52 as they are on the crashed host [[phab:T250869|T250869]]
 
=== 2020-04-20 ===
* 15:31 bd808: Rebuilding Docker containers to pick up tools-webservice v0.68 ([[phab:T250625|T250625]])
* 14:47 arturo: added joakino to tools.admin LDAP group
* 13:28 jeh: shutdown elasticsearch v5 cluster running Jessie [[phab:T236606|T236606]]
* 12:46 arturo: uploading tools-webservice v0.68 to aptly stretch-tools and update it on relevant servers ([[phab:T250625|T250625]])
* 12:06 arturo: uploaded tools-webservice v0.68 to stretch-toolsbeta for testing
* 11:59 arturo: `root@tools-sge-services-03:~# aptly db cleanup` removed 340 unreferenced packages, and 2 unreferenced files
 
=== 2020-04-15 ===
* 23:20 bd808: Building ruby25-sssd/base and children ([[phab:T141388|T141388]], [[phab:T250118|T250118]])
* 20:09 jeh: update default security group to allow prometheus01.metricsinfra.eqiad.wmflabs TCP 9100 [[phab:T250206|T250206]]
 
=== 2020-04-14 ===
* 18:26 bstorm_: Deployed new code and RBAC for maintain-kubeusers [[phab:T246123|T246123]]
* 18:19 bstorm_: updating the maintain-kubeusers:latest image [[phab:T246123|T246123]]
* 17:32 bstorm_: updating the maintain-kubeusers:beta image on tools-docker-imagebuilder-01 [[phab:T246123|T246123]]
 
=== 2020-04-10 ===
* 21:33 bd808: Rebuilding all Docker images for the Kubernetes cluster ([[phab:T249843|T249843]])
* 19:36 bstorm_: after testing deploying toollabs-webservice 0.67 to tools repos [[phab:T249843|T249843]]
* 14:53 arturo: live-hacking tools-puppetmaster-02 with https://gerrit.wikimedia.org/r/c/operations/puppet/+/587991 for [[phab:T249837|T249837]]
 
=== 2020-04-09 ===
* 15:13 bd808: Rebuilding all stretch and buster Docker images. Jessie is broken at the moment due to package version mismatches
* 11:18 arturo: bump nproc limit in bastions https://gerrit.wikimedia.org/r/c/operations/puppet/+/587715 ([[phab:T219070|T219070]])
* 04:29 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 [try #2] ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 04:19 bd808: python3 build.py --image-prefix toolforge --tag latest --no-cache --push --single jessie-sssd
* 00:20 bd808: Docker rebuild failed in toolforge-python2-sssd-base: "zlib1g-dev : Depends: zlib1g (= 1:1.2.8.dfsg-2+b1) but 1:1.2.8.dfsg-2+deb8u1 is to be installed"
 
=== 2020-04-08 ===
* 23:49 bd808: Running rebuild_all for Docker images to pick up toollabs-webservice v0.66 ([[phab:T154504|T154504]], [[phab:T234617|T234617]])
* 23:35 bstorm_: deploy toollabs-webservice v0.66 [[phab:T154504|T154504]] [[phab:T234617|T234617]]
 
=== 2020-04-07 ===
* 20:06 andrewbogott: sss_cache -E on tools-sgebastion-08 and  tools-sgebastion-09
* 20:00 andrewbogott: sss_cache -E on tools-sgebastion-07
 
=== 2020-04-06 ===
* 19:16 bstorm_: deleted tools-redis-1001/2 [[phab:T248929|T248929]]
 
=== 2020-04-03 ===
* 22:40 bstorm_: shut down tools-redis-1001/2 [[phab:T248929|T248929]]
* 22:32 bstorm_: switch tools-redis-1003 to the active redis server [[phab:T248929|T248929]]
* 20:41 bstorm_: deleting tools-redis-1003/4 to attach them to an anti-affinity group [[phab:T248929|T248929]]
* 18:53 bstorm_: spin up tools-redis-1004 on stretch and connect to cluster [[phab:T248929|T248929]]
* 18:23 bstorm_: spin up tools-redis-1003 on stretch and connect to the cluster [[phab:T248929|T248929]]
* 16:50 bstorm_: launching tools-redis-03 (Buster) to see what happens
 
=== 2020-03-30 ===
* 18:28 bstorm_: Beginning rolling depool, remount, repool of k8s workers for [[phab:T248702|T248702]]
* 18:22 bstorm_: disabled puppet across tools-k8s-worker-[1-55].tools.eqiad.wmflabs [[phab:T248702|T248702]]
* 16:56 arturo: dropping `_psl.toolforge.org` TXT record ([[phab:T168677|T168677]])
 
=== 2020-03-27 ===
* 21:22 bstorm_: removed puppet prefix tools-docker-builder [[phab:T248703|T248703]]
* 21:15 bstorm_: deleted tools-docker-builder-06 [[phab:T248703|T248703]]
* 18:55 bstorm_: launching tools-docker-imagebuilder-01 [[phab:T248703|T248703]]
* 12:52 arturo: install python3-pykube on tools-k8s-control-3 for some tests interaction with the API from python
 
=== 2020-03-24 ===
* 11:44 arturo: trying to solve a rebase/merge conflict in labs/private.git in tools-puppetmaster-02
* 11:33 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]]) (second try with some additional bits in LUA)
* 10:16 arturo: merging tools-proxy patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/579952/ ([[phab:T234617|T234617]])
 
=== 2020-03-18 ===
* 19:07 bstorm_: removed role::toollabs::logging::sender from project puppet (it wouldn't work anyway)
* 18:04 bstorm_: removed puppet prefix tools-flannel-etcd [[phab:T246689|T246689]]
* 17:58 bstorm_: removed puppet prefix tools-worker [[phab:T246689|T246689]]
* 17:57 bstorm_: removed puppet prefix tools-k8s-master [[phab:T246689|T246689]]
* 17:36 bstorm_: removed lots of deprecated hiera keys from horizon for the old cluster [[phab:T246689|T246689]]
* 16:59 bstorm_: deleting "tools-worker-1002", "tools-worker-1001", "tools-k8s-master-01", "tools-flannel-etcd-03", "tools-k8s-etcd-03", "tools-flannel-etcd-02", "tools-k8s-etcd-02", "tools-flannel-etcd-01", "tools-k8s-etcd-01" [[phab:T246689|T246689]]
 
=== 2020-03-17 ===
* 13:29 arturo: set `profile::toolforge::bastion::nproc: 200` for tools-sgebastion-08 ([[phab:T219070|T219070]])
* 00:08 bstorm_: shut off tools-flannel-etcd-01/02/03 [[phab:T246689|T246689]]
 
=== 2020-03-16 ===
* 22:01 bstorm_: shut off tools-k8s-etcd-01/02/03 [[phab:T246689|T246689]]
* 22:00 bstorm_: shut off tools-k8s-master-01 [[phab:T246689|T246689]]
* 21:59 bstorm_: shut down tools-worker-1001 and tools-worker-1002 [[phab:T246689|T246689]]
 
=== 2020-03-11 ===
* 17:00 jeh: clean up apt cache on tools-sgebastion-07
 
=== 2020-03-06 ===
* 16:25 bstorm_: updating maintain-kubeusers image to filter invalid tool names
 
=== 2020-03-03 ===
* 18:16 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad1.wikimedia.cloud (eqiad1 subdomain change) [[phab:T236606|T236606]]
* 18:02 jeh: create OpenStack DNS record for elasticsearch.svc.tools.eqiad.wikimedia.cloud [[phab:T236606|T236606]]
* 17:31 jeh: create a OpenStack virtual ip address for the new elasticsearch cluster [[phab:T236606|T236606]]
* 10:54 arturo: deleted VMs `tools-worker-[1003-1020]` (legacy k8s cluster) ([[phab:T246689|T246689]])
* 10:51 arturo: cordoned/drained all legacy k8s worker nodes except 1001/1002 ([[phab:T246689|T246689]])
 
=== 2020-03-02 ===
* 22:26 jeh: starting first pass of elasticsearch data migration to new cluster [[phab:T236606|T236606]]
 
=== 2020-03-01 ===
* 01:48 bstorm_: old version of kubectl removed. Anyone who needs it can download it with `curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.4.12/bin/linux/amd64/kubectl`
* 01:27 bstorm_: running the force-migrate command to make sure any new kubernetes deployments are on the new cluster.
 
=== 2020-02-28 ===
* 22:14 bstorm_: shutting down the old maintain-kubeusers and taking the gloves off the new one (removing --gentle-mode)
* 16:51 bstorm_: node/tools-k8s-worker-15 uncordoned
* 16:44 bstorm_: drained tools-k8s-worker-15 and hard rebooting it because it wasn't happy
* 16:36 bstorm_: rebooting k8s workers 1-35 on the 2020 cluster to clear a strange nologin condition that has been there since the NFS maintenance
* 16:14 bstorm_: rebooted tools-k8s-worker-7 to clear some puppet issues
* 16:00 bd808: Devoicing stashbot in #wikimedia-cloud to reduce irc spam while migrating tools to 2020 Kubernetes cluster
* 15:28 jeh: create OpenStack server group tools-elastic with anti-affinty policy enabled [[phab:T236606|T236606]]
* 15:09 jeh: create 3 new elasticsearch VMs tools-elastic-[1,2,3] [[phab:T236606|T236606]]
* 14:20 jeh: create new puppet prefixes for existing (no change in data) and new elasticsearch VMs
* 04:35 bd808: Joined tools-k8s-worker-54 to 2020 Kubernetes cluster
* 04:34 bd808: Joined tools-k8s-worker-53 to 2020 Kubernetes cluster
* 04:32 bd808: Joined tools-k8s-worker-52 to 2020 Kubernetes cluster
* 04:31 bd808: Joined tools-k8s-worker-51 to 2020 Kubernetes cluster
* 04:28 bd808: Joined tools-k8s-worker-50 to 2020 Kubernetes cluster
* 04:24 bd808: Joined tools-k8s-worker-49 to 2020 Kubernetes cluster
* 04:23 bd808: Joined tools-k8s-worker-48 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-47 to 2020 Kubernetes cluster
* 04:21 bd808: Joined tools-k8s-worker-46 to 2020 Kubernetes cluster
* 04:19 bd808: Joined tools-k8s-worker-45 to 2020 Kubernetes cluster
* 04:14 bd808: Joined tools-k8s-worker-44 to 2020 Kubernetes cluster
* 04:13 bd808: Joined tools-k8s-worker-43 to 2020 Kubernetes cluster
* 04:12 bd808: Joined tools-k8s-worker-42 to 2020 Kubernetes cluster
* 04:10 bd808: Joined tools-k8s-worker-41 to 2020 Kubernetes cluster
* 04:09 bd808: Joined tools-k8s-worker-40 to 2020 Kubernetes cluster
* 04:08 bd808: Joined tools-k8s-worker-39 to 2020 Kubernetes cluster
* 04:07 bd808: Joined tools-k8s-worker-38 to 2020 Kubernetes cluster
* 04:06 bd808: Joined tools-k8s-worker-37 to 2020 Kubernetes cluster
* 03:49 bd808: Joined tools-k8s-worker-36 to 2020 Kubernetes cluster
* 00:50 bstorm_: rebuilt all docker images to include webservice 0.64
 
=== 2020-02-27 ===
* 23:27 bstorm_: installed toollabs-webservice 0.64 on the bastions
* 23:24 bstorm_: pushed toollabs-webservice version 0.64 to all toolforge repos
* 21:03 jeh: add reindex service account to elasticsearch for data migration [[phab:T236606|T236606]]
* 20:57 bstorm_: upgrading toollabs-webservice to stretch-toolsbeta version for jdk8:testing image only
* 20:19 jeh: update elasticsearch VPS security group to allow toolsbeta-elastic7-1 access on tcp 80 [[phab:T236606|T236606]]
* 18:53 bstorm_: hard rebooted a rather stuck tools-sgecron-01
* 18:20 bd808: Building tools-k8s-worker-[36-55]
* 17:56 bd808: Deleted instances tools-worker-10[21-40]
* 16:14 bd808: Decommissioning tools-worker-10[21-40]
* 16:02 bd808: Drained tools-worker-1021
* 15:51 bd808: Drained tools-worker-1022
* 15:44 bd808: Drained tools-worker-1023 (there is no tools-worker-1024)
* 15:39 bd808: Drained tools-worker-1025
* 15:39 bd808: Drained tools-worker-1026
* 15:11 bd808: Drained tools-worker-1027
* 15:09 bd808: Drained tools-worker-1028 (there is no tools-worker-1029)
* 15:07 bd808: Drained tools-worker-1030
* 15:06 bd808: Uncordoned tools-worker-10[16-20]. Was over optimistic about repacking legacy Kubernetes cluster into 15 instances. Will keep 20 for now.
* 15:00 bd808: Drained tools-worker-1031
* 14:54 bd808: Hard reboot tools-worker-1016. Direct virsh console unresponsive. Stuck in shutdown since 2020-01-22?
* 14:44 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 14:41 bd808: Drained tools-worker-1032
* 14:37 bd808: Drained tools-worker-1033
* 14:35 bd808: Drained tools-worker-1034
* 14:34 bd808: Drained tools-worker-1035
* 14:33 bd808: Drained tools-worker-1036
* 14:33 bd808: Drained tools-worker-10<nowiki>{</nowiki>39,38,37<nowiki>}</nowiki> yesterday but did not !log
* 00:29 bd808: Drained tools-worker-1009 for reboot (NFS flakey)
* 00:11 bd808: Uncordoned tools-worker-1009.tools.eqiad.wmflabs
* 00:08 bd808: Uncordoned tools-worker-1002.tools.eqiad.wmflabs
* 00:02 bd808: Rebooting tools-worker-1002
* 00:00 bd808: Draining tools-worker-1002 to reboot for NFS problems
 
=== 2020-02-26 ===
* 23:42 bd808: Drained tools-worker-1040
* 23:41 bd808: Cordoned tools-worker-10[16-40] in preparation for shrinking legacy Kubernetes cluster
* 23:12 bstorm_: replacing all tool limit-ranges in the 2020 cluster with a lower cpu request version
* 22:29 bstorm_: deleted pod maintain-kubeusers-6d9c45f4bc-5bqq5 to deploy new image
* 21:06 bstorm_: deleting loads of stuck grid jobs
* 20:27 jeh: rebooting tools-worker-[1008,1015,1021]
* 20:15 bstorm_: rebooting tools-sgegrid-master because it actually had the permissions thing going on still
* 18:03 bstorm_: downtimed toolschecker for nfs maintenance
 
=== 2020-02-25 ===
* 15:31 bd808: `wmcs-k8s-enable-cluster-monitor toolschecker`
 
=== 2020-02-23 ===
* 00:40 Krenair: [[phab:T245932|T245932]]
 
=== 2020-02-21 ===
* 16:02 andrewbogott: moving tools-sgecron-01 to cloudvirt1022
 
=== 2020-02-20 ===
* 14:49 andrewbogott: moving tools-k8s-worker-19 and tools-k8s-worker-18 to cloudvirt1022 (as part of draining 1014)
* 00:04 Krenair: Shut off tools-puppetmaster-01 - to be deleted in one week [[phab:T245365|T245365]]
 
=== 2020-02-19 ===
* 22:05 Krenair: Project-wide hiera change to swap puppetmaster to tools-puppetmaster-02 [[phab:T245365|T245365]]
* 15:36 bstorm_: setting 'puppetmaster: tools-puppetmaster-02.tools.eqiad.wmflabs' on tools-sgeexec-0942 to test new puppetmaster on grid [[phab:T245365|T245365]]
* 11:50 arturo: fix invalid yaml format in horizon puppet prefix 'tools-k8s-haproxy' that prevented clean puppet run in the VMs
* 00:59 bd808: Live hacked the "nginx-configuration" ConfigMap for [[phab:T245426|T245426]] (done several hours ago, but I forgot to !log it)
 
=== 2020-02-18 ===
* 23:26 bstorm_: added tools-sgegrid-master.tools.eqiad1.wikimedia.cloud and tools-sgegrid-shadow.tools.eqiad1.wikimedia.cloud to gridengine admin host lists
* 09:50 arturo: temporarily delete DNS zone tools.wmcloud.org to try re-creating it
 
=== 2020-02-17 ===
* 18:53 arturo: [[phab:T168677|T168677]] created DNS TXT record _psl.toolforge.org. with value `https://github.com/publicsuffix/list/pull/970`
* 13:22 arturo: relocating tools-sgewebgrid-lighttpd-0914 to cloudvirt1012 to spread same VMs across different hypervisors
 
=== 2020-02-14 ===
* 00:38 bd808: Added tools-k8s-worker-35 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:34 bd808: Added tools-k8s-worker-34 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:32 bd808: Added tools-k8s-worker-33 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:29 bd808: Added tools-k8s-worker-32 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-31 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:25 bd808: Added tools-k8s-worker-30 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:17 bd808: Added tools-k8s-worker-29 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:15 bd808: Added tools-k8s-worker-28 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:13 bd808: Added tools-k8s-worker-27 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:07 bd808: Added tools-k8s-worker-26 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 00:03 bd808: Added tools-k8s-worker-25 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
 
=== 2020-02-13 ===
* 23:53 bd808: Added tools-k8s-worker-24 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 23:50 bd808: Added tools-k8s-worker-23 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 23:38 bd808: Added tools-k8s-worker-22 to 2020 Kubernetes cluster ([[phab:T244791|T244791]])
* 21:35 bd808: Deleted tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> ([[phab:T244791|T244791]])
* 21:33 bd808: Removed tools-sgewebgrid-lighttpd-092<nowiki>{</nowiki>1,2,3,4,5,6,7,8<nowiki>}</nowiki> & tools-sgewebgrid-generic-090<nowiki>{</nowiki>3,4<nowiki>}</nowiki> from grid engine config ([[phab:T244791|T244791]])
* 17:43 andrewbogott: migrating b24e29d7-a468-4882-9652-{{Gerrit|9863c8acfb88}} to cloudvirt1022
 
=== 2020-02-12 ===
* 19:29 bd808: Rebuilding all Docker images to pick up toollabs-webservice (0.63) ([[phab:T244954|T244954]])
* 19:15 bd808: Deployed toollabs-webservice (0.63) on bastions ([[phab:T244954|T244954]])
* 00:20 bd808: Depooling tools-sgewebgrid-generic-0903 ([[phab:T244791|T244791]])
* 00:19 bd808: Depooling tools-sgewebgrid-generic-0904 ([[phab:T244791|T244791]])
* 00:14 bd808: Depooling tools-sgewebgrid-lighttpd-0921 ([[phab:T244791|T244791]])
* 00:09 bd808: Depooling tools-sgewebgrid-lighttpd-0922 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0923 ([[phab:T244791|T244791]])
* 00:05 bd808: Depooling tools-sgewebgrid-lighttpd-0924 ([[phab:T244791|T244791]])
 
=== 2020-02-11 ===
* 23:58 bd808: Depooling tools-sgewebgrid-lighttpd-0925 ([[phab:T244791|T244791]])
* 23:56 bd808: Depooling tools-sgewebgrid-lighttpd-0926 ([[phab:T244791|T244791]])
* 23:38 bd808: Depooling tools-sgewebgrid-lighttpd-0927 ([[phab:T244791|T244791]])
 
=== 2020-02-10 ===
* 23:39 bstorm_: updated tools-manifest to 0.21 on aptly for stretch
* 22:51 bstorm_: all docker images now use webservice 0.62
* 22:01 bd808: Manually starting webservices for tools that were running on tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:47 bd808: Depooling tools-sgewebgrid-lighttpd-0928 ([[phab:T244791|T244791]])
* 21:25 bstorm_: upgraded toollabs-webservice package for tools to 0.62 [[phab:T244293|T244293]] [[phab:T244289|T244289]] [[phab:T234617|T234617]] [[phab:T156626|T156626]]
 
=== 2020-02-07 ===
* 10:55 arturo: drop jessie VM instances tools-prometheus-<nowiki>{</nowiki>01,02<nowiki>}</nowiki> which were shutdown ([[phab:T238096|T238096]])
 
=== 2020-02-06 ===
* 10:44 arturo: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/565556 which is a behavior change to the Toolforge front proxy ([[phab:T234617|T234617]])
* 10:27 arturo: shutdown again tools-prometheus-01, no longer in use ([[phab:T238096|T238096]])
* 05:07 andrewbogott: cleared out old /tmp and /var/log files on tools-sgebastion-07
 
=== 2020-02-05 ===
* 11:22 arturo: restarting ferm fleet-wide to account for prometheus servers changed IP (but same hostname) ([[phab:T238096|T238096]])
 
=== 2020-02-04 ===
* 11:38 arturo: start again tools-prometheus-01 again to sync data to the new tools-prometheus-03/04 VMs ([[phab:T238096|T238096]])
* 11:37 arturo: re-create tools-prometheus-03/04 as 'bigdisk2' instances (300GB) [[phab:T238096|T238096]]
 
=== 2020-02-03 ===
* 14:12 arturo: move tools-prometheus-04 from cloudvirt1022 to cloudvirt1013
* 12:48 arturo: shutdown tools-prometheus-01 and tools-prometheus-02, after fixing the proxy `tools-prometheus.wmflabs.org` to tools-prometheus-03, data synced ([[phab:T238096|T238096]])
* 09:38 arturo: tools-prometheus-01: systemctl stop prometheus@tools. Another try to migrate data to tools-prometheus-<nowiki>{</nowiki>03,04<nowiki>}</nowiki> ([[phab:T238096|T238096]])
 
=== 2020-01-31 ===
* 14:06 arturo: leave tools-prometheus-01 as the backend for tools-prometheus.wmflabs.org for the weekend so grafana dashboards keep working ([[phab:T238096|T238096]])
* 14:00 arturo: syncing again prometheus data from tools-prometheus-01 to tools-prometheus-0<nowiki>{</nowiki>3,4<nowiki>}</nowiki> due to some inconsistencies preventing prometheus from starting ([[phab:T238096|T238096]])
 
=== 2020-01-30 ===
* 21:04 andrewbogott: also apt-get install python3-novaclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 20:39 andrewbogott: apt-get install python3-keystoneclient  on tools-prometheus-03 and tools-prometheus-04 to suppress cronspam.  Possible real fix for this is https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/569084/
* 16:27 arturo: create VM tools-prometheus-04 as cold standby of tools-prometheus-03 ([[phab:T238096|T238096]])
* 16:25 arturo: point tools-prometheus.wmflabs.org proxy to tools-prometheus-03 ([[phab:T238096|T238096]])
* 13:42 arturo: disable puppet in prometheus servers while syncing metric data ([[phab:T238096|T238096]])
* 13:15 arturo: drop floating IP 185.15.56.60 and FQDN `prometheus.tools.wmcloud.org` because this is not how the prometheus setup is right now. Use a web proxy instead `tools-prometheus-new.wmflabs.org` ([[phab:T238096|T238096]])
* 13:09 arturo: created FQDN `prometheus.tools.wmcloud.org` pointing to IPv4 185.15.56.60 (tools-prometheus-03) to test [[phab:T238096|T238096]]
* 12:59 arturo: associated floating IPv4 185.15.56.60 to tools-prometheus-03 ([[phab:T238096|T238096]])
* 12:57 arturo: created domain `tools.wmcloud.org` in the tools project after some back and forth with designated, permissions and the database. I plan to use this domain to test the new Debian Buster-based prometheus setup ([[phab:T238096|T238096]])
* 10:20 arturo: create new VM instance tools-prometheus-03 ([[phab:T238096|T238096]])
 
=== 2020-01-29 ===
* 20:07 bd808: Created <nowiki>{</nowiki>bastion,login,dev<nowiki>}</nowiki>.toolforge.org service names for Toolforge bastions using Horizon & Designate
 
=== 2020-01-28 ===
* 13:35 arturo: `aborrero@tools-clushmaster-02:~$ clush -w @exec-stretch 'for i in $(ps aux {{!}} grep [t]ools.j {{!}} awk -F" " "<nowiki>{</nowiki>print \$2<nowiki>}</nowiki>") ; do  echo "killing $i" ; sudo kill $i ; done {{!}}{{!}} true'` ([[phab:T243831|T243831]])
 
=== 2020-01-27 ===
* 07:05 zhuyifei1999_: wrong package. uninstalled. the correct one is bpfcc-tools and seems only available in buster+. [[phab:T115231|T115231]]
* 07:01 zhuyifei1999_: apt installing bcc on tools-worker-1037 to see who is sending SIGTERM, will uninstall after done. dependency: bin86. [[phab:T115231|T115231]]
 
=== 2020-01-24 ===
* 20:58 bd808: Built tools-k8s-worker-21 to test out build script following openstack client upgrade
* 15:45 bd808: Rebuilding all Docker containers again because I failed to actually update the build server git clone properly last time I did this
* 05:23 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster (take 2)
* 04:41 bd808: Rebuilding all Docker images to pick up webservice-python-bootstrap changes
 
=== 2020-01-23 ===
* 23:38 bd808: Halted tools-k8s-worker build script after first instance (tools-k8s-worker-10) stuck in "scheduling" state for 20 minutes
* 23:16 bd808: Building 6 new tools-k8s-worker instances for the 2020 Kubernetes cluster
* 05:15 bd808: Building tools-elastic-04
* 04:39 bd808: wmcs-openstack quota set --instances 192
* 04:36 bd808: wmcs-openstack quota set --cores 768 --ram {{Gerrit|1536000}}
 
=== 2020-01-22 ===
* 12:43 arturo: for the record, issue with tools-worker-1016 was memory exhaustion apparently
* 12:35 arturo: hard-reboot tools-worker-1016 (not responding to even console access)
 
=== 2020-01-21 ===
* 19:25 bstorm_: hard rebooting tools-sgeexec-0913/14/35 because they aren't even on the network
* 19:17 bstorm_: depooled and rebooted tools-sgeexec-0914 because it was acting funny
* 18:30 bstorm_: depooling and rebooting tools-sgeexec-[0911,0913,0919,0921,0924,0931,0933,0935,0939,0941].tools.eqiad.wmflabs
* 17:21 bstorm_: rebooting toolschecker to recover stale nfs handle
 
=== 2020-01-16 ===
* 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
* 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
* 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
* 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` [[phab:T242397|T242397]]
 
=== 2020-01-14 ===
* 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
* 02:23 andrewbogott: rebooting tools-paws-worker-1006  to resolve hangs associated with an old NFS failure
 
=== 2020-01-13 ===
* 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 ([[phab:T242642|T242642]])
* 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. [[phab:T242559|T242559]]
* 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. [[phab:T242559|T242559]]
* 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. [[phab:T242559|T242559]]
* 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. [[phab:T242559|T242559]]
 
=== 2020-01-12 ===
* 22:31 Krenair: same on -13 and -14
* 22:28 Krenair: same on -8
* 22:18 Krenair: same on -7
* 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created
 
=== 2020-01-11 ===
* 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.
 
=== 2020-01-10 ===
* 23:31 bstorm_: updated toollabs-webservice package to 0.56
* 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
* 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
* 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so
 
=== 2020-01-09 ===
* 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
* 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:06 bstorm_: rebooting tools-paws-master-01 [[phab:T242353|T242353]]
* 17:46 bstorm_: refreshing the paws cluster's entire x509 environment [[phab:T242353|T242353]]
 
=== 2020-01-07 ===
* 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
* 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
* 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
* 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster [[phab:T242067|T242067]]
* 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` ([[phab:T241853|T241853]])
* 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 ([[phab:T241853|T241853]])
* 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 ([[phab:T241853|T241853]])
* 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace ([[phab:T241853|T241853]])
* 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
* 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
* 05:02 bd808: Creating tools-k8s-worker-[6-14]
* 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
* 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
* 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
* 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread
 
=== 2020-01-06 ===
* 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
* 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
* 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
* 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
* 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
* 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 [[phab:T241908|T241908]]
* 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 [[phab:T241908|T241908]]
* 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix [[phab:T241908|T241908]]
* 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
* 16:42 bstorm_: failed sge-shadow-master back to the main grid master
* 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master
 
=== 2020-01-04 ===
* 18:11 bd808: Shutdown tools-worker-1029
* 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
* 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
* 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:16 bd808: Draining tools-worker-10<nowiki>{</nowiki>05,12,28<nowiki>}</nowiki> due to hardware errors ([[phab:T241884|T241884]])
* 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241873|T241873]])
 
=== 2020-01-03 ===
* 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
* 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 ([[phab:T237643|T237643]])
* 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for [[phab:T237643|T237643]]
* 03:04 bd808: Really rebuilding all <nowiki>{</nowiki>jessie,stretch,buster<nowiki>}</nowiki>-sssd images. Last time I forgot to actually update the git clone.
* 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox
 
=== 2020-01-02 ===
* 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox
 
=== 2019-12-30 ===
* 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for [[phab:T241523|T241523]]
* 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full
 
=== 2019-12-29 ===
* 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - [[phab:T241523|T241523]]
 
=== 2019-12-27 ===
* 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07
 
=== 2019-12-25 ===
* 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07
 
=== 2019-12-22 ===
* 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test ([[phab:T241310|T241310]])
* 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change ([[phab:T241310|T241310]])
 
=== 2019-12-20 ===
* 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
* 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues
 
=== 2019-12-18 ===
* 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
* 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.
 
=== 2019-12-17 ===
* 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
* 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster [[phab:T234037|T234037]]
* 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster [[phab:T214513|T214513]] [[phab:T228499|T228499]]
* 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
* 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster [[phab:T214513|T214513]]
* 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster [[phab:T214513|T214513]] (more successfully this time)
* 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs [[phab:T214513|T214513]]
* 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit [[phab:T214513|T214513]]
* 00:45 bstorm_: enabled encryption at rest on the new k8s cluster
 
=== 2019-12-16 ===
* 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
* 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster
 
=== 2019-12-14 ===
* 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).
 
=== 2019-12-13 ===
* 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
* 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
* 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
* 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
* 00:45 bstorm_: rebooting tools-static-13
* 00:28 bstorm_: rebooting the k8s master to clear NFS errors
* 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream
 
=== 2019-12-12 ===
* 23:36 bstorm_: rebooting toolschecker after downtiming the services
* 22:58 bstorm_: rebooting tools-acme-chief-01
* 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
* 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
* 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
* 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
* 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues
 
=== 2019-12-11 ===
* 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
* 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031
 
=== 2019-12-10 ===
* 13:59 arturo: set pod replicas to 3 in the new k8s cluster ([[phab:T239405|T239405]])
 
=== 2019-12-09 ===
* 11:06 andrewbogott: deleting unused security groups:  catgraph, devpi, MTA, mysql, syslog, test    [[phab:T91619|T91619]]
 
=== 2019-12-04 ===
* 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use
 
=== 2019-11-29 ===
* 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` ([[phab:T239403|T239403]])
* 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
* 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)
 
=== 2019-11-26 ===
* 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones [[phab:T236202|T236202]]
* 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds [[phab:T236202|T236202]]
* 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
* 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
* 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
* 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config
 
=== 2019-11-25 ===
* 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
 
=== 2019-11-22 ===
* 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it ([[phab:T238654|T238654]])
* 05:55 jeh: add Riley Huntley `riley` to base tools project
 
=== 2019-11-21 ===
* 12:48 arturo: reboot the new k8s cluster after the upgrade
* 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 ([[phab:T238654|T238654]])
* 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 ([[phab:T238654|T238654]])
* 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm ([[phab:T238654|T238654]])
* 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster ([[phab:T238654|T238654]])
 
=== 2019-11-19 ===
* 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh ([[phab:T237643|T237643]])
* 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster ([[phab:T237643|T237643]])
 
=== 2019-11-15 ===
* 14:44 arturo: stop live-hacks on tools-prometheus-01 [[phab:T237643|T237643]]
 
=== 2019-11-13 ===
* 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster ([[phab:T237643|T237643]])
 
=== 2019-11-12 ===
* 12:52 arturo: reboot tools-proxy-06 to reset iptables setup [[phab:T238058|T238058]]
 
=== 2019-11-10 ===
* 02:17 bd808: Building new Docker images for [[phab:T237836|T237836]] (retrying after cleaning out old images on tools-docker-builder-06)
* 02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 02:10 bd808: Building new Docker images for [[phab:T237836|T237836]]
* 01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta [[phab:T237836|T237836]]
 
=== 2019-11-08 ===
* 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
* 18:40 bstorm_: pushed new webservice package to the bastions [[phab:T230961|T230961]]
* 18:37 bstorm_: pushed new webservice package supporting buster containers to repo [[phab:T230961|T230961]]
* 18:36 bstorm_: pushed buster-sssd images to the docker repo
* 17:15 phamhi: pushed new buster images with the prefix name "toolforge"
 
=== 2019-11-07 ===
* 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster ([[phab:T236826|T236826]])
* 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` [[phab:T236826|T236826]]
* 12:57 arturo: increasing project quota [[phab:T237633|T237633]]
* 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 [[phab:T236826|T236826]]
* 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` [[phab:T236826|T236826]]
* 11:43 arturo: create puppet prefix `tools-k8s-haproxy` [[phab:T236826|T236826]]
 
=== 2019-11-06 ===
* 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
* 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed [[phab:T215531|T215531]]
* 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
* 16:10 arturo: new k8s cluster control nodes are bootstrapped ([[phab:T236826|T236826]])
* 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap ([[phab:T236826|T236826]])
* 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` ([[phab:T236826|T236826]])
* 13:43 arturo: created `tools-k8s-control` puppet prefix [[phab:T236826|T236826]]
* 11:57 phamhi: restarted all webservices in grid ([[phab:T233347|T233347]])
 
=== 2019-11-05 ===
* 23:08 Krenair: Dropped {{Gerrit|59a77a3}}, {{Gerrit|3830802}}, and {{Gerrit|83df61f}} from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required [[phab:T206235|T206235]]
* 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. [[phab:T236952|T236952]]
* 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch [[phab:T237468|T237468]]
* 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
* 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 ([[phab:T233347|T233347]])
* 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] ([[phab:T233347|T233347]])
* 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] ([[phab:T233347|T233347]])
* 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] ([[phab:T233347|T233347]])
* 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` [[phab:T236826|T236826]]
 
=== 2019-11-04 ===
* 14:45 phamhi: Built and pushed ruby25 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed golang111 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed jdk11 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed php73 docker image based on buster ([[phab:T230961|T230961]])
* 11:10 phamhi: Built and pushed python37 docker image based on buster ([[phab:T230961|T230961]])
 
=== 2019-11-01 ===
* 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
* 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy [[phab:T236952|T236952]]
 
=== 2019-10-31 ===
* 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001.  Runaway logfiles filled up the drive which prevented puppet from running.  If puppet had run, it would have prevented the runaway logfiles.
* 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` [[phab:T236826|T236826]]
* 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
* 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently ([[phab:T236962|T236962]])
* 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master ([[phab:T236962|T236962]])
 
=== 2019-10-30 ===
* 13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) [[phab:T235252|T235252]]
* 13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) [[phab:T235252|T235252]]
* 13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert [[phab:T235252|T235252]]
 
=== 2019-10-29 ===
* 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
* 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 [[phab:T235627|T235627]]
 
=== 2019-10-28 ===
* 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
* 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed [[phab:T235627|T235627]]
* 15:54 arturo: shutting down tools-proxy-03 [[phab:T235627|T235627]]
* 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
* 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy [[phab:T235627|T235627]]
* 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy [[phab:T235627|T235627]]
* 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
* 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
* 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix ([[phab:T235627|T235627]])
* 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile ([[phab:T235627|T235627]])
* 14:34 arturo: icinga downtime toolschecker for 1h ([[phab:T235627|T235627]])
* 12:25 arturo: upload image `coredns` v1.3.1 ({{Gerrit|eb516548c180}}) to docker registry ([[phab:T236249|T236249]])
* 12:23 arturo: upload image `kube-apiserver` v1.15.1 ({{Gerrit|68c3eb07bfc3}}) to docker registry ([[phab:T236249|T236249]])
* 12:22 arturo: upload image `kube-controller-manager` v1.15.1 ({{Gerrit|d75082f1d121}}) to docker registry ([[phab:T236249|T236249]])
* 12:20 arturo: upload image `kube-proxy` v1.15.1 ({{Gerrit|89a062da739d}}) to docker registry ([[phab:T236249|T236249]])
* 12:19 arturo: upload image `kube-scheduler` v1.15.1 ({{Gerrit|b0b3c4c404da}}) to docker registry ([[phab:T236249|T236249]])
* 12:04 arturo: upload image `calico/node` v3.8.0 ({{Gerrit|cd3efa20ff37}}) to docker registry ([[phab:T236249|T236249]])
* 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 ({{Gerrit|f68c8f870a03}}) to docker registry ([[phab:T236249|T236249]])
* 12:01 arturo: upload image `calico/cni` v3.8.0 ({{Gerrit|539ca36a4c13}}) to docker registry ([[phab:T236249|T236249]])
* 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 ({{Gerrit|df5ff96cd966}}) to docker registry ([[phab:T236249|T236249]])
* 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 ({{Gerrit|0439eb3e11f1}}) to docker registry ([[phab:T236249|T236249]])
 
=== 2019-10-24 ===
* 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge
 
=== 2019-10-23 ===
* 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 ([[phab:T233347|T233347]])
* 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools ([[phab:T233347|T233347]])
* 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
* 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting
 
=== 2019-10-22 ===
* 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
* 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone
 
=== 2019-10-21 ===
* 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46
 
=== 2019-10-18 ===
* 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
* 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
* 21:29 bd808: Rescheduled all grid engine webservice jobs ([[phab:T217815|T217815]])
 
=== 2019-10-16 ===
* 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools ([[phab:T218461|T218461]])
* 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
* 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)
 
=== 2019-10-15 ===
* 17:10 phamhi: restart tools-worker-1035 because it is no longer responding
 
=== 2019-10-14 ===
* 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes ([[phab:T229261|T229261]])
 
=== 2019-10-11 ===
* 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
* 11:55 arturo: create tools-test-proxy-01 VM for testing [[phab:T235059|T235059]] and a puppet prefix for it
* 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
 
=== 2019-10-10 ===
* 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.
 
=== 2019-10-09 ===
* 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
* 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
* 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
* 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
* 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
* 12:33 arturo: drain tools-worker-1010 to rebalance load
* 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
* 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
* 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
* 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting
 
=== 2019-10-08 ===
* 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
* 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
* 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
* 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
* 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.
 
=== 2019-10-07 ===
* 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
* 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
* 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
* 19:25 bstorm_: deleted tools-puppetmaster-02
* 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
* 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
* 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
* 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
* 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
* 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
* 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
* 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
* 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
* 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
* 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
* 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
* 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
* 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
* 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
* 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
* 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
* 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
* 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
* 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
* 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
* 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
* 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
* 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
* 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
* 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
* 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
* 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
* 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
* 16:41 bstorm_: reboot tools-sgebastion-07
* 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08
 
=== 2019-10-04 ===
* 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
* 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
* 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
* 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
* 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated
 
=== 2019-10-03 ===
* 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required
 
=== 2019-09-27 ===
* 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
* 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927
 
=== 2019-09-25 ===
* 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021
 
=== 2019-09-23 ===
* 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
* 06:01 bd808: Restarted maintain-dbusers process on labstore1004. ([[phab:T233530|T233530]])
 
=== 2019-09-12 ===
* 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used
 
=== 2019-09-11 ===
* 13:30 jeh: restart tools-sgeexec-0912
 
=== 2019-09-09 ===
* 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038
 
=== 2019-09-06 ===
* 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 ([[phab:T194859|T194859]])
 
=== 2019-09-05 ===
* 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run ([[phab:T232135|T232135]])
* 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)
 
=== 2019-09-01 ===
* 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01
 
=== 2019-08-30 ===
* 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
* 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts
 
=== 2019-08-29 ===
* 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
* 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
* 22:05 bd808: Jessie Docker image rebuild complete
* 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use
 
=== 2019-08-27 ===
* 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again
 
=== 2019-08-26 ===
* 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905
 
=== 2019-08-18 ===
* 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01
 
=== 2019-08-17 ===
* 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck
 
=== 2019-08-15 ===
* 15:32 jeh: upgraded jobutils debian package to 1.38 [[phab:T229551|T229551]]
* 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces
 
=== 2019-08-13 ===
* 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
* 13:41 jeh: Set icingia downtime for toolschecker labs showmount [[phab:T229448|T229448]]
 
=== 2019-08-12 ===
* 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes ([[phab:T230147|T230147]])
 
=== 2019-08-08 ===
* 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 [[phab:T230157|T230157]]
 
=== 2019-08-07 ===
* 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi ([[phab:T229713|T229713]])
 
=== 2019-08-06 ===
* 16:18 arturo: add phamhi as user/projectadmin ([[phab:T228942|T228942]]) and delete hpham
* 15:59 arturo: add hpham as user/projectadmin ([[phab:T228942|T228942]])
* 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts [[phab:T221301|T221301]]
 
=== 2019-08-05 ===
* 22:49 bstorm_: launching tools-worker-1040
* 20:36 andrewbogott: rebooting oom tools-worker-1026
* 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` [[phab:T229846|T229846]]
* 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again ([[phab:T229787|T229787]])
* 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` ([[phab:T229787|T229787]])
 
=== 2019-08-02 ===
* 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive
 
=== 2019-07-31 ===
* 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
* 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
* 17:32 bstorm_: drained tools-worker-1028 to rebalance load
* 17:29 bstorm_: drained tools-worker-1008 to rebalance load
* 17:23 bstorm_: drained tools-worker-1021 to rebalance load
* 17:17 bstorm_: drained tools-worker-1007 to rebalance load
* 17:07 bstorm_: drained tools-worker-1004 to rebalance load
* 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
* 15:33 bstorm_: [[phab:T228573|T228573]] spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)
 
=== 2019-07-27 ===
* 23:00 zhuyifei1999_: a past probably related ticket: [[phab:T194859|T194859]]
* 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
 
=== 2019-07-26 ===
* 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
* 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
* 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
* 16:32 bstorm_: created tools-worker-1034 - [[phab:T228573|T228573]]
* 15:57 bstorm_: created tools-worker-1032 and 1033 - [[phab:T228573|T228573]]
* 15:55 bstorm_: created tools-worker-1031 - [[phab:T228573|T228573]]
 
=== 2019-07-25 ===
* 22:01 bstorm_: [[phab:T228573|T228573]] created tools-worker-1030
* 21:22 jeh: rebooting tools-worker-1016 unresponsive
 
=== 2019-07-24 ===
* 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
* 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
 
=== 2019-07-22 ===
* 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
* 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
* 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
* 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
* 17:55 bstorm_: draining tools-worker-1023 since it is having issues
* 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats [[phab:T228573|T228573]]
 
=== 2019-07-20 ===
* 19:52 andrewbogott: rebooting tools-worker-1023
 
=== 2019-07-17 ===
* 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014
 
=== 2019-07-15 ===
* 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job {{Gerrit|5190035}}
 
=== 2019-06-25 ===
* 09:30 arturo: detected puppet issue in all VMs: [[phab:T226480|T226480]]
 
=== 2019-06-24 ===
* 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015
 
=== 2019-06-17 ===
* 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
* 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: [[phab:T220853|T220853]] )
 
=== 2019-06-11 ===
* 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs
 
=== 2019-06-05 ===
* 18:33 andrewbogott: repooled  tools-sgeexec-0921 and tools-sgeexec-0929
* 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929
 
=== 2019-05-30 ===
* 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
* 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
* 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
* 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
* 12:35 arturo: enable puppet in tools-worker nodes
* 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because [[phab:T224651|T224651]] ([[phab:T224558|T224558]])
* 12:25 arturo: cordon/drain tools-worker-1002 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:23 arturo: cordon/drain tools-worker-1001 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:22 arturo: cordon/drain tools-worker-1029 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:20 arturo: cordon/drain tools-worker-1003 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 11:59 arturo: [[phab:T224558|T224558]] repool tools-worker-1003 (using sssd/sudo now!)
* 11:23 arturo: [[phab:T224558|T224558]] depool tools-worker-1003
* 10:48 arturo: [[phab:T224558|T224558]] drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
* 10:33 arturo: [[phab:T224558|T224558]] switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:28 arturo: [[phab:T224558|T224558]] use hiera config in prefix tools-worker for sssd/sudo
* 10:27 arturo: [[phab:T224558|T224558]] switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:09 arturo: [[phab:T224558|T224558]] disable puppet in all tools-worker- nodes
* 10:01 arturo: [[phab:T224558|T224558]] add tools-worker-1029 to the nodes pool of k8s
* 09:58 arturo: [[phab:T224558|T224558]] reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
 
=== 2019-05-29 ===
* 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
* 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes ([[phab:T221225|T221225]])
* 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
* 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning
 
=== 2019-05-28 ===
* 18:15 arturo: [[phab:T221225|T221225]] for the record, tools-worker-1001 is not working after trying with sssd
* 18:13 arturo: [[phab:T221225|T221225]] created tools-worker-1029 to test sssd/sudo stuff
* 17:49 arturo: [[phab:T221225|T221225]] repool tools-worker-1002 (using nscd/nslcd and sudoldap)
* 17:44 arturo: [[phab:T221225|T221225]] back to classic/ldap hiera config in the tools-worker puppet prefix
* 17:35 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001 again
* 17:27 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001
* 17:12 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1002
* 17:09 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1001
* 17:08 arturo: [[phab:T221225|T221225]] switch to sssd/sudo in puppet prefix for tools-worker
* 13:04 arturo: [[phab:T221225|T221225]] depool and rebooted tools-worker-1001 in preparation for sssd migration
* 12:39 arturo: [[phab:T221225|T221225]] disable puppet in all tools-worker nodes in preparation for sssd
* 12:32 arturo: drop the tools-bastion puppet prefix, unused
* 12:31 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
* 12:27 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
* 12:16 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
* 11:26 arturo: merged change to the sudo module to allow sssd transition
 
=== 2019-05-27 ===
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%
 
=== 2019-05-21 ===
* 12:35 arturo: [[phab:T223992|T223992]] rebooting tools-redis-1002
 
=== 2019-05-20 ===
* 11:25 arturo: [[phab:T223332|T223332]] enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
* 10:53 arturo: [[phab:T223332|T223332]] disable puppet agent in tools-k8s-master and tools-docker-registry nodes
 
=== 2019-05-18 ===
* 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image ([[phab:T217908|T217908]])
* 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45
 
=== 2019-05-17 ===
* 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
* 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)
 
=== 2019-05-16 ===
* 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
* 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time
 
=== 2019-05-15 ===
* 16:20 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-0921 and -0929
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0921 and move to cloudvirt1014
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and move to cloudvirt1014
* 12:29 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-09[37,39]
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0937 and move to cloudvirt1008
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0939 and move to cloudvirt1007
* 11:34 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0940
* 11:20 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0940 and move to cloudvirt1006
* 11:11 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0941
* 10:46 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0941 and move to cloudvirt1005
* 09:44 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0901
* 09:00 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0901 and reallocate to cloudvirt1004
 
=== 2019-05-14 ===
* 17:12 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0920
* 16:37 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and reallocate to cloudvirt1003
* 16:36 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0911
* 15:56 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0911 and reallocate to cloudvirt1003
* 15:52 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0909
* 15:24 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0909 and reallocate to cloudvirt1002
* 15:24 arturo: [[phab:T223148|T223148]] last SAL entry is bogus, please ignore (depool tools-worker-1009)
* 15:23 arturo: [[phab:T223148|T223148]] depool tools-worker-1009
* 15:13 arturo: [[phab:T223148|T223148]] repool tools-worker-1023
* 13:16 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0942
* 13:03 arturo: [[phab:T223148|T223148]] repool tools-sgewebgrid-generic-0904
* 12:58 arturo: [[phab:T223148|T223148]] reallocating tools-worker-1023 to cloudvirt1001
* 12:56 arturo: [[phab:T223148|T223148]] depool tools-worker-1023
* 12:52 arturo: [[phab:T223148|T223148]] reallocating tools-sgeexec-0942 to cloudvirt1001
* 12:50 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0942
* 12:49 arturo: [[phab:T223148|T223148]] reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
* 12:43 arturo: [[phab:T223148|T223148]] depool tools-sgewebgrid-generic-0904
 
=== 2019-05-13 ===
* 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs
 
=== 2019-05-07 ===
* 14:38 arturo: [[phab:T222718|T222718]] uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
* 14:31 arturo: [[phab:T222718|T222718]] reboot tools-worker-1009 and 1022 after being drained
* 14:28 arturo: k8s drain tools-worker-1009 and 1022
* 11:46 arturo: [[phab:T219362|T219362]] enable puppet in tools-redis servers and use the new puppet role
* 11:33 arturo: [[phab:T219362|T219362]] disable puppet in tools-reds servers for puppet code cleanup
* 11:12 arturo: [[phab:T219362|T219362]] drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
* 11:10 arturo: [[phab:T219362|T219362]] enable puppet in tools-static servers and use new puppet role
* 11:01 arturo: [[phab:T219362|T219362]] disable puppet in tools-static servers for puppet code cleanup
* 10:16 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-lighttpd` puppet prefix
* 10:14 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-generic` puppet prefix
* 10:06 arturo: [[phab:T219362|T219362]] drop the `tools-exec-1` puppet prefix
 
=== 2019-05-06 ===
* 11:34 arturo: [[phab:T221225|T221225]] reenable puppet
* 10:53 arturo: [[phab:T221225|T221225]] disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)
 
=== 2019-05-03 ===
* 09:43 arturo: fixed puppet in tools-puppetdb-01 too
* 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
* 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
* 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
* 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
* 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
* 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package
 
=== 2019-04-30 ===
* 12:50 arturo: enable puppet in all servers [[phab:T221225|T221225]]
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd ([[phab:T221225|T221225]])
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
* 11:07 arturo: [[phab:T221225|T221225]] disable puppet in toolforge
* 10:56 arturo: [[phab:T221225|T221225]] create tools-sgebastion-0test for more sssd tests
 
=== 2019-04-29 ===
* 11:22 arturo: [[phab:T221225|T221225]] re-enable puppet agent in all toolforge servers
* 10:27 arturo: [[phab:T221225|T221225]] reboot tool-sgebastion-09 for testing sssd
* 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test [[phab:T221225|T221225]]
* 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages
 
=== 2019-04-26 ===
* 12:20 andrewbogott: rescheduling every pod everywhere
* 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs
 
=== 2019-04-25 ===
* 12:49 arturo: [[phab:T221225|T221225]] using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
* 11:43 arturo: [[phab:T221793|T221793]] removing prometheus crontab and letting puppet agent re-create it again to resolve staleness
 
=== 2019-04-24 ===
* 12:54 arturo: puppet broken, fixing right now
* 09:18 arturo: [[phab:T221225|T221225]] reallocating tools-sgebastion-09 to cloudvirt1008
 
=== 2019-04-23 ===
* 15:26 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-08 to cleanup sssd
* 15:19 arturo: [[phab:T221225|T221225]] creating tools-sgebastion-09 for testing sssd stuff
* 13:06 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
* 12:57 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
* 10:28 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
* 10:27 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-07 to clean sssd confiuration
* 10:16 arturo: [[phab:T221225|T221225]] disable puppet in tools-sgebastion-08 for sssd testing
* 09:49 arturo: [[phab:T221225|T221225]] run puppet agent in the bastions and reboot them with sssd
* 09:43 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
* 09:41 arturo: [[phab:T221225|T221225]] disable puppet agent in the bastions
 
=== 2019-04-17 ===
* 12:09 arturo: [[phab:T221225|T221225]] rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
* 11:59 arturo: [[phab:T221205|T221205]] sssd was deployed successfully into all webgrid nodes
* 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
* 11:31 arturo: reboot bastions for sssd deployment
* 11:30 arturo: deploy sssd to bastions
* 11:24 arturo: disable puppet in bastions to deploy sssd
* 09:52 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
* 09:45 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
* 09:12 arturo: [[phab:T221205|T221205]] start deploying sssd to sgewebgrid nodes
* 09:00 arturo: [[phab:T221205|T221205]] add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
* 08:57 arturo: [[phab:T221205|T221205]] disable puppet in all tools-sgewebgrid-* nodes
 
=== 2019-04-16 ===
* 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
* 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
* 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r
 
=== 2019-04-15 ===
* 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
* 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r
 
=== 2019-04-14 ===
* 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them
 
=== 2019-04-13 ===
* 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for [[phab:T220853|T220853]]
* 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for [[phab:T220853|T220853]]
* 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 [[phab:T220853|T220853]]
* 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 [[phab:T220853|T220853]]
 
=== 2019-04-11 ===
* 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
* 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
* 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
* 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
* 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
* 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
* 15:40 andrewbogott: moving tools-redis-1002  to eqiad1-r
* 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
* 12:01 arturo: [[phab:T151704|T151704]] deploying oidentd
* 11:54 arturo: disable puppet in all hosts to deploy oidentd
* 02:33 andrewbogott: tools-paws-worker-1005,  tools-paws-worker-1006 to eqiad1-r
* 00:03 andrewbogott: tools-paws-worker-1002,  tools-paws-worker-1003 to eqiad1-r
 
=== 2019-04-10 ===
* 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
* 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
* 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
* 14:49 bstorm_: cleared E state from 5 queues
* 13:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0906
* 12:31 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0926
* 12:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0925
* 12:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0901
* 11:55 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0924
* 11:47 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0921
* 11:23 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0940
* 11:03 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0928
* 10:49 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0923
* 10:43 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0915
* 10:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0935
* 10:19 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0914
* 10:02 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0907
* 09:41 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0918
* 09:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0932
* 09:26 arturo: [[phab:T218216|T218216]] hard reboot tools-sgeexec-0932
* 09:04 arturo: [[phab:T218216|T218216]] add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
* 09:03 arturo: [[phab:T218216|T218216]] do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
* 08:39 arturo: [[phab:T218216|T218216]] disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
* 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r
 
=== 2019-04-09 ===
* 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
* 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
* 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
* 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
* 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
* 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
* 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
* 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
* 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
* 17:05 andrewbogott: migrating  tools-k8s-etcd-01 to eqiad1-r
* 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
* 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
* 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
* 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
* 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get  the k8s node moves to register
 
=== 2019-04-08 ===
* 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
* 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r
 
=== 2019-04-07 ===
* 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
* 01:06 bstorm_: cleared E state from 6 queues
 
=== 2019-04-05 ===
* 15:44 bstorm_: cleared E state from two exec queues
 
=== 2019-04-04 ===
* 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
* 20:53 bd808: Rebooting tools-worker-1013
* 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
* 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
* 20:28 bd808: Shutdown tools-checker-01 via Horizon
* 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
* 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
* 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
* 20:03 bstorm_: depooled  tools-webgrid-lighttpd-0912
* 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
* 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
* 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
* 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
* 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
* 19:13 bstorm_: cleared E state from 7 queues
* 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host
 
=== 2019-04-03 ===
* 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already
 
=== 2019-04-02 ===
* 12:11 arturo: icinga downtime toolschecker for 1 month [[phab:T219243|T219243]]
* 03:55 bd808: Added etcd service group to tools-k8s-etcd-* ([[phab:T219243|T219243]])
 
=== 2019-04-01 ===
* 19:44 bd808: Deleted tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 19:43 bd808: Shutdown tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 16:53 bstorm_: cleared E state on 6 grid queues
* 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)
 
=== 2019-03-29 ===
* 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
* 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 ([[phab:T219243|T219243]])
* 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
* 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker ([[phab:T219243|T219243]])
* 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing ([[phab:T219243|T219243]])
* 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier ([[phab:T219243|T219243]])
* 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 sudo qmod -cj` on tools-sgegrid-master
* 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
* 17:11 bd808: Restarted nginx on tools-static-13
* 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
* 16:49 bstorm_: cleared E state from 21 queues
* 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
* 13:54 andrewbogott: moving tools-static-13 to eqiad1-r
 
=== 2019-03-28 ===
* 01:00 bstorm_: cleared error states from two queues
* 00:23 bstorm_: [[phab:T216060|T216060]] created tools-sgewebgrid-generic-0901...again!
 
=== 2019-03-27 ===
* 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue [[phab:T219460|T219460]]
* 14:45 bstorm_: cleared several "E" state queues
* 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
* 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
* 12:15 arturo: [[phab:T218126|T218126]] `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)
 
=== 2019-03-26 ===
* 22:00 gtirloni: downtimed toolschecker
* 17:31 arturo: [[phab:T218126|T218126]] create VM instances tools-sssd-sgeexec-test-[12]
* 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
* 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org
 
=== 2019-03-25 ===
* 21:21 bd808: All Trusty grid engine hosts shutdown and deleted ([[phab:T217152|T217152]])
* {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow}  (T217152)}}
* 21:18 bd808: Deleted tools-webgrid-lighttpd-14*  ([[phab:T217152|T217152]])
* 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
* 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
* 20:51 bd808: Deleted tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-143* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-142* ([[phab:T217152|T217152]])
* 20:48 bd808: Deleted tools-exec-141* ([[phab:T217152|T217152]])
* 20:47 bd808: Deleted tools-exec-140* ([[phab:T217152|T217152]])
* 20:43 bd808: Deleted  tools-cron-01 ([[phab:T217152|T217152]])
* 20:42 bd808: Deleted tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
* 19:59 bd808: Shutdown tools-exec-143* ([[phab:T217152|T217152]])
* 19:51 bd808: Shutdown tools-exec-142* ([[phab:T217152|T217152]])
* 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
* 19:33 bd808: Shutdown tools-exec-141* ([[phab:T217152|T217152]])
* 19:31 bd808: Shutdown tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 19:19 bd808: Shutdown tools-exec-140* ([[phab:T217152|T217152]])
* 19:12 bd808: Shutdown tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-master ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-shadow ([[phab:T217152|T217152]])
* 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
* 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
* 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
* 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
* 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} ([[phab:T217152|T217152]])
* 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} ([[phab:T217152|T217152]])
 
=== 2019-03-22 ===
* 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
* 16:12 bstorm_: cleared errored out stretch grid queues
* 15:56 bd808: Rebooting tools-static-12
* 03:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted 15 other nodes.  Entire stretch grid is in a good state for now.
* 02:31 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
* 02:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0924
* 00:39 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0902
 
=== 2019-03-21 ===
* 23:28 bstorm_: [[phab:T217280|T217280]] depooled, reloaded and repooled tools-sgeexec-0938
* 21:53 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
* 21:51 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
* 21:26 bstorm_: [[phab:T217280|T217280]] cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related
 
=== 2019-03-18 ===
* 18:43 bd808: Rebooting tools-static-12
* 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01{{!}}07{{!}}10)` all else working
* 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
* 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
* 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down
 
=== 2019-03-17 ===
* 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for [[phab:T218494|T218494]]
* 22:30 bd808: Investigating strange system state on tools-bastion-03.
* 17:48 bstorm_: [[phab:T218514|T218514]] rebooting tools-worker-1009 and 1012
* 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for [[phab:T218514|T218514]]
* 17:13 bstorm_: depooled and rebooting tools-worker-1018
* 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs
 
=== 2019-03-16 ===
* 22:34 bstorm_: clearing errored out queues again
 
=== 2019-03-15 ===
* 21:08 bstorm_: cleared error state on several queues [[phab:T217280|T217280]]
* 15:58 gtirloni: rebooted tools-clushmaster-02
* 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - [[phab:T130532|T130532]]
* 14:32 mutante: tools-sgebastion-07 - generating locales for user request in [[phab:T130532|T130532]]
 
=== 2019-03-14 ===
* 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} ([[phab:T217152|T217152]])
* 23:28 bd808: Deleted tools-bastion-05 ([[phab:T217152|T217152]])
* 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
* 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} ([[phab:T217152|T217152]])
* 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon ([[phab:T217152|T217152]])
* 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 ([[phab:T218341|T218341]])
* 21:32 gtirloni: rebooted tools-exec-1020 ([[phab:T218341|T218341]])
* 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 ([[phab:T218341|T218341]])
* 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled ([[phab:T217152|T217152]])
* 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
* 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
* 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
* 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
* 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
* 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
* 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
* 20:36 bd808: depooled and rebooted tools-sgeexec-0908
* 19:08 gtirloni: rebooted tools-worker-1028 ([[phab:T218341|T218341]])
* 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 ([[phab:T218341|T218341]])
* 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
* 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)
 
=== 2019-03-13 ===
* 23:30 bd808: Rebuilding stretch Kubernetes images
* 22:55 bd808: Rebuilding jessie Kubernetes images
* 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
* 17:10 bstorm_: rebooted cron server
* 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
* 12:33 arturo: reboot tools-sgebastion-08 ([[phab:T215154|T215154]])
* 12:17 arturo: reboot tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:53 arturo: enable puppet in tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:20 arturo: disable puppet in tools-sgebastion-07 for testing [[phab:T215154|T215154]]
* 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
* 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
* 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 ([[phab:T217406|T217406]])
 
=== 2019-03-11 ===
* 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot ([[phab:T218038|T218038]])
* 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI ([[phab:T218038|T218038]])
* 15:42 bd808: Rebooting tools-sgegrid-master ([[phab:T218038|T218038]])
* 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
* 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
 
=== 2019-03-10 ===
* 22:36 gtirloni: increased nscd group TTL from 60 to 300sec
 
=== 2019-03-08 ===
* 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
* 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)
 
=== 2019-03-07 ===
* 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
* 04:15 bd808: Killed 3 orphan processes on Trusty grid
* 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups ([[phab:T217280|T217280]])
* 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch [[phab:T217406|T217406]]
* 00:38 zhuyifei1999_: published misctools 1.37 [[phab:T217406|T217406]]
* 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild [[phab:T217406|T217406]]
 
=== 2019-03-06 ===
* 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02
 
=== 2019-03-04 ===
* 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for [[phab:T217473|T217473]]
* {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}
 
=== 2019-03-03 ===
* 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412
 
=== 2019-02-28 ===
* 19:36 zhuyifei1999_: built with debuild instead [[phab:T217297|T217297]]
* 19:08 zhuyifei1999_: test failures during build, see ticket
* 18:55 zhuyifei1999_: start building jobutils 1.36 [[phab:T217297|T217297]]
 
=== 2019-02-27 ===
* 20:41 andrewbogott: restarting nginx on tools-checker-01
* 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
* 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test [[phab:T176027|T176027]]
* 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
* 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon ([[phab:T217152|T217152]])
* 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
 
=== 2019-02-26 ===
* 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
* 19:01 gtirloni: pushed updated docker images
* 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test
 
=== 2019-02-25 ===
* 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for [[phab:T217066|T217066]]
* 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test [[phab:T217066|T217066]]
* 13:11 chicocvenancio: PAWS:  Stopped AABot notebook pod [[phab:T217010|T217010]]
* 12:54 chicocvenancio: PAWS:  Restarted Criscod notebook pod [[phab:T217010|T217010]]
* 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod [[phab:T217010|T217010]]
* 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} ([[phab:T216988|T216988]])
* 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
* 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
* 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
* 07:48 zhuyifei1999_: systemd stuck in D state. :(
* 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
* 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
* 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.
 
=== 2019-02-22 ===
* 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
* 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
* 15:13 gtirloni: shutdown tools-puppetmaster-01
 
=== 2019-02-21 ===
* 09:59 gtirloni: upgraded all packages in all stretch nodes
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
 
=== 2019-02-20 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442
 
=== 2019-02-19 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
 
=== 2019-02-18 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness
 
=== 2019-02-17 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever
 
=== 2019-02-16 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP
 
=== 2019-02-14 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r
 
=== 2019-02-13 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07
 
=== 2019-02-12 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
 
=== 2019-02-11 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
 
=== 2019-02-08 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07
 
=== 2019-02-07 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]
 
=== 2019-02-04 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06
 
=== 2019-01-30 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
 
=== 2019-01-25 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]
 
=== 2019-01-24 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
 
=== 2019-01-23 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])
 
=== 2019-01-22 ===
* 20:21 gtirloni: published new docker images (all)
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
 
=== 2019-01-21 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
 
=== 2019-01-18 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
 
=== 2019-01-17 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04
 
=== 2019-01-16 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04
 
=== 2019-01-15 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
 
=== 2019-01-14 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
 
=== 2019-01-11 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1
 
=== 2019-01-10 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid
 
=== 2019-01-09 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])
 
=== 2019-01-07 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
 
=== 2019-01-06 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
 
=== 2019-01-05 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
 
=== 2019-01-04 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
 
=== 2019-01-03 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01
 
=== 2018-12-21 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]
 
=== 2018-12-20 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002
 
=== 2018-12-17 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`
 
=== 2018-12-11 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
 
=== 2018-12-05 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
 
=== 2018-12-04 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])
 
=== 2018-12-01 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts
 
=== 2018-11-30 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])
 
=== 2018-11-27 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
 
=== 2018-11-26 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])
 
=== 2018-11-20 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 09:43 godog: restart prometheus@tools on prometheus-01
 
=== 2018-11-16 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 17:47 gtirloni: deleted tools-mail instance
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades
 
=== 2018-11-14 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009
 
=== 2018-11-13 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit
 
=== 2018-11-08 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])
 
=== 2018-11-07 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
 
=== 2018-11-02 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]
 
=== 2018-10-31 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
 
=== 2018-10-29 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
 
=== 2018-10-26 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
 
=== 2018-10-19 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
 
=== 2018-10-18 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
 
=== 2018-10-16 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
 
=== 2018-10-07 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens
 
=== 2018-09-21 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines
 
=== 2018-09-17 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
 
=== 2018-09-14 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API
 
=== 2018-09-08 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
 
=== 2018-09-07 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
 
=== 2018-08-27 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
 
=== 2018-08-22 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
 
=== 2018-08-19 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
 
=== 2018-08-14 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
 
=== 2018-08-13 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice
 
=== 2018-08-09 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates
 
=== 2018-08-08 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
 
=== 2018-08-06 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
 
=== 2018-08-01 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
 
=== 2018-07-30 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools
 
=== 2018-07-27 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
 
=== 2018-07-25 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
 
=== 2018-07-18 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]
 
=== 2018-06-30 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
 
=== 2018-06-29 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]
 
=== 2018-06-28 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 16:48 arturo: rebooting tools-docker-registry-01
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
 
=== 2018-06-21 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-20 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
 
=== 2018-06-14 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-11 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
 
=== 2018-06-08 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
 
=== 2018-06-07 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
 
=== 2018-06-06 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
 
=== 2018-06-05 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])
 
=== 2018-06-04 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
 
=== 2018-06-03 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
 
=== 2018-05-31 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
 
=== 2018-05-30 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]
 
=== 2018-05-28 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
 
=== 2018-05-25 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
 
=== 2018-05-22 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
 
=== 2018-05-18 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
 
=== 2018-05-16 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
 
=== 2018-05-15 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414.  It's hanging for unknown reasons.
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding
 
=== 2018-05-12 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
 
=== 2018-05-11 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2018-05-10 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
 
=== 2018-05-09 ===
* 21:11 Reedy: Added Tim Starling as member/admin
 
=== 2018-05-07 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
 
=== 2018-05-05 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
 
=== 2018-05-03 ===
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
 
=== 2018-05-01 ===
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
 
=== 2018-04-27 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
 
=== 2018-04-23 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
 
=== 2018-04-22 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`
 
=== 2018-04-15 ===
* 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] [[phab:T192224|T192224]]
* 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci [[phab:T192224|T192224]]
 
=== 2018-04-11 ===
* 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
 
=== 2018-04-06 ===
* 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
* 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to [[phab:T159254|T159254]]
* 11:23 arturo: manually upgrade apache2 on tools-puppemaster for [[phab:T159254|T159254]]
 
=== 2018-04-05 ===
* 18:46 chicocvenancio: killed wget that was hogging io
 
=== 2018-03-29 ===
* 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
* 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done
 
=== 2018-03-28 ===
* 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
 
=== 2018-03-26 ===
* 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-23 ===
* 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/
 
=== 2018-03-22 ===
* 22:04 bd808: Forced puppet run on tools-proxy-02 for [[phab:T130748|T130748]]
* 21:52 bd808: Forced puppet run on tools-proxy-01 for [[phab:T130748|T130748]]
* 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
* 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-21 ===
* 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
* 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid ([[phab:T190185|T190185]])
 
=== 2018-03-20 ===
* 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) [[phab:T189018|T189018]] [[phab:T190126|T190126]]
 
=== 2018-03-19 ===
* 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
 
=== 2018-03-16 ===
* 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
* 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp
 
=== 2018-03-15 ===
* 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot [[phab:T185624|T185624]]
 
=== 2018-03-14 ===
* 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 ([[phab:T181531|T181531]])
* 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 ([[phab:T181531|T181531]])
* 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 ([[phab:T181531|T181531]])
* 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
* 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
* 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full
 
=== 2018-03-12 ===
* 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
* 17:13 arturo: [[phab:T188994|T188994]] upgrading packages from `stable`
* 16:53 arturo: [[phab:T188994|T188994]] upgrading packages from stretch-wikimedia
* 16:33 arturo: [[phab:T188994|T188994]] upgrading packages form jessie-wikimedia
* 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 {{Gerrit|5f3561e}} [[phab:T189430|T189430]]
* 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
* 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
* 13:19 arturo: [[phab:T188994|T188994]] upgrade packages from jessie-backports in all jessie servers
* 12:49 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-updates in all ubuntu servers
* 12:34 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-wikimedia in all ubuntu servers
 
=== 2018-03-08 ===
* 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
* 14:02 arturo: [[phab:T188994|T188994]] upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
 
=== 2018-03-07 ===
* 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
* 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
* 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
* 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/
 
=== 2018-03-06 ===
* 16:15 madhuvishy: Reboot tools-docker-registry-02 [[phab:T189018|T189018]]
* 15:50 madhuvishy: Rebooting tools-worker-1011
* 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
* 15:03 arturo: drain and reboot tools-worker-1011
* 15:03 chasemp: rebooted tools-worker 1001-1008
* 14:58 arturo: drain and reboot tools-worker-1010
* 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
* 14:27 chasemp: reboot tools-worker-100[12]
* 14:23 chasemp: downtime icinga alert for k8s workers ready
* 13:21 arturo: [[phab:T188994|T188994]] in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
* 12:58 arturo: [[phab:T188994|T188994]] upgrading packages in jessie nodes from the oldstable source
* 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
* 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
* 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic ([[phab:T188911|T188911]])
* 11:33 arturo: removing unused kernel packages in ubuntu nodes
* 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster
 
=== 2018-03-05 ===
* 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
* 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb [[phab:T167026|T167026]] [[phab:T181492|T181492]]
* 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for [[phab:T188911|T188911]]
* 14:01 arturo: deleting old kernel packages in jessie instances for [[phab:T188911|T188911]]
* 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
* 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for [[phab:T187193|T187193]]
* 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for [[phab:T187193|T187193]]
 
=== 2018-03-02 ===
* 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
 
=== 2018-03-01 ===
* 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
 
=== 2018-02-27 ===
* 17:37 chasemp: add chico as admin to toolsbeta
* 12:23 arturo: running `apt-get autoclean` in canary servers
* 12:16 arturo: running `apt-get autoremove` in canary servers
 
=== 2018-02-26 ===
* 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
* 10:35 arturo: enable puppet in tools-proxy-01
* 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests
 
=== 2018-02-25 ===
* 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
 
=== 2018-02-23 ===
* 19:11 arturo: enable puppet in tools-proxy-01
* 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
* 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
* 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
 
=== 2018-02-22 ===
* 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
 
=== 2018-02-21 ===
* 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
* 18:15 arturo: puppet should be fine across the fleet
* 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
* 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
* 16:59 arturo: puppet is broken across the cluster due to last change
* 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
* 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
* 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
* 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
* 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
* 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
* 09:18 chicocvenancio: killed io intensive tool job in bastion
* 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...
 
=== 2018-02-20 ===
* 12:42 arturo: upgrading tools-flannel-etcd-01
* 12:42 arturo: upgrading tools-k8s-etcd-01
 
=== 2018-02-19 ===
* 19:13 arturo: upgrade all packages of tools-services-01
* 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
* 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
* 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration
 
=== 2018-02-16 ===
* 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
* 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
* 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
* 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
* 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
* 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
* 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
* 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y
 
=== 2018-02-15 ===
* 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for [[phab:T187435|T187435]]
* 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
* 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
* 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
* 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
* 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
* 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
 
=== 2018-02-14 ===
* 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment ([[phab:T187315|T187315]])
* 13:04 arturo: reboot tools-paws-master-01 for [[phab:T187315|T187315]]
 
=== 2018-02-11 ===
* 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
* 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
 
=== 2018-02-09 ===
* 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ [[phab:T179343|T179343]] [[phab:T182562|T182562]] [[phab:T186846|T186846]]
* 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
* 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
* 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
* 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
* 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
* 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 ([[phab:T186830|T186830]])
* 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there
 
=== 2018-02-08 ===
* 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
* 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
* 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
* 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
* 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
* 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
* 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
* 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
* 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
* 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
* 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
* 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
* 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
* 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.
 
=== 2018-02-06 ===
* 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
* 13:05 arturo: unpublish/publish trusty-tools repo
* 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for [[phab:T186539|T186539]] after adding it to trusty-tools repo (self contained)
 
=== 2018-02-05 ===
* 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address [[phab:T186539|T186539]]
* 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
* 13:06 arturo: deploying fix for [[phab:T186230|T186230]] using clush
 
=== 2018-02-03 ===
* 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools  python3 ./broken_ref_anchors.py"
 
=== 2018-01-31 ===
* 22:54 chasemp: add bstorm to sudoers as root
 
=== 2018-01-29 ===
* 20:02 chasemp: add zhuyifei1999_ tools root for  [[phab:T185577|T185577]]
* 20:01 chasemp: blast a puppet run to see if any errors are persistent
 
=== 2018-01-28 ===
* 22:49 chicocvenancio: killed compromised session generating miner processes
* 22:48 chicocvenancio: killed miner processes in tools-bastion-03
 
=== 2018-01-27 ===
* 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
* 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
 
=== 2018-01-25 ===
* 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
* 23:20 arturo: [[phab:T179386|T179386]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 05:25 arturo: deploying misctools and jobutils 1.29 for [[phab:T179386|T179386]]
 
=== 2018-01-23 ===
* 19:41 madhuvishy: Add bstorm to project admins
* 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
* 14:17 chasemp: add me, arturo, chico to sudoers and removed marc
 
=== 2018-01-22 ===
* 18:32 arturo: [[phab:T181948|T181948]] [[phab:T185314|T185314]] deploying jobutils and misctools v1.28 in the cluster
* 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
* 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
* 10:18 arturo: [[phab:T181948|T181948]] deploy misctools 1.27 in the cluster
 
=== 2018-01-19 ===
* 17:32 arturo: [[phab:T185314|T185314]] deploying new version of jobutils 1.27
* 12:56 arturo: the puppet status across the fleet seems good, only minor things like [[phab:T185314|T185314]] , [[phab:T179388|T179388]] and [[phab:T179386|T179386]]
* 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
 
=== 2018-01-18 ===
* 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to [[phab:T182781|T182781]])
* 15:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 13:52 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter {{!}} grep lsbdistcodename {{!}} grep trusty && sudo apt-upgrade trusty-wikimedia -v'
* 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
* 12:24 arturo: [[phab:T178717|T178717]] aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
* 12:11 arturo: [[phab:T178717|T178717]] aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
* 11:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-17 ===
* 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions {{!}} grep upgradeable {{!}} grep trusty-wikimedia' {{!}} tee pending-upgrades-report-trusty-wikimedia.txt
* 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' {{!}} tee pending-upgrades-report.txt
* 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
* 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
* 15:04 andrewbogott: depooling exec-manage tools-exec-1430.  Experimenting with purge-old-kernels
* 14:09 arturo: [[phab:T181647|T181647]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-16 ===
* 22:01 chasemp: qstat -explain E -xml {{!}} grep 'name' {{!}} sed 's/<name>//' {{!}} sed 's/<\/name>//'  {{!}} xargs qmod -cq
* 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
* 21:24 andrewbogott: repooled tools-exec-1420  and tools-webgrid-lighttpd-1417
* 21:14 andrewbogott: depooling tools-exec-1420  and tools-webgrid-lighttpd-1417
* 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412  and tools-exec-1423 for host reboot
* 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413  tools-exec-1442 for host reboot
* 18:50 andrewbogott: switched active proxy back to tools-proxy-02
* 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
* 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
* 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
* 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
* 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
* 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
* 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
* 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
* 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
* 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
* 13:35 chasemp: tools-mail  almouked@ltnet.net 719 pending messages cleared
 
=== 2018-01-11 ===
* 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
* 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
* 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 19:00 chasemp: reboot tools-worker-1015
* 15:08 chasemp: reboot tools-exec-1405
* 15:06 chasemp: reboot tools-exec-1404
* 15:06 chasemp: reboot tools-exec-1403
* 15:02 chasemp: reboot tools-exec-1402
* 14:57 chasemp: reboot tools-exec-1401 again...
* 14:53 chasemp: reboot tools-exec-1401
* 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot
 
=== 2018-01-10 ===
* 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
* 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
* 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
* 13:57 arturo: [[phab:T184604|T184604]] cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
* 13:46 arturo: [[phab:T184604|T184604]] aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
* 13:45 arturo: [[phab:T184604|T184604]] aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
* 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
* 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
* 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
* 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for [[phab:T184604|T184604]]
* 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened [[phab:T184604|T184604]]
 
=== 2018-01-09 ===
* 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
* 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
* 23:01 yuvipanda: kill paws master and reboot it
* 22:54 yuvipanda: kill all kube-system pods in paws cluster
* 22:54 yuvipanda: kill all PAWS pods
* 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
* 22:49 yuvipanda: run  clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
* 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes
* 22:46 yuvipanda: reboot all paws-worker nodes
* 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
* 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
* 20:55 chasemp: for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016`; do kubectl cordon $n; done
* 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
* 20:15 chasemp: disable puppet on proxies and k8s workers
* 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
* 19:42 chasemp: reboot tools-worker-1010
 
=== 2018-01-08 ===
* 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
* 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
 
=== 2018-01-06 ===
* 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
* 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)
 
=== 2018-01-05 ===
* 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
* 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
* 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
* 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)
 
=== 2018-01-04 ===
* 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of [[phab:T184018|T184018]]
 
=== 2018-01-03 ===
* 15:38 bd808: Forced Puppet run on tools-services-01
* 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush


==Archives==
==Archives==
* [[/Archive 1|Archive 1]] (2013-2014)
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
* [[/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
</noinclude>
</noinclude>
{{SAL|Project Name=tools}}
{{SAL|Project Name=tools}}
<noinclude>[[Category:SAL]]</noinclude>
<noinclude>[[Category:SAL]]</noinclude>

Revision as of 17:51, 23 June 2022

2022-06-23

  • 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653

2022-06-22

  • 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-21

  • 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-03

  • 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
  • 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
  • 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
  • 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
  • 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
  • 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
  • 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
  • 13:20 bd808: publish tools-webservice 0.86 (T309821)
  • 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
  • 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
  • 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
  • 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
  • 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
  • 03:10 bd808: publish tools-webservice 0.85 with hack for T309821

2022-06-02

  • 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
  • 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
  • 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
  • 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
  • 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
  • 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:03 dcaro: refresh prometheus certs (T308402)
  • 11:47 dcaro: refresh registry-admission-controller certs (