You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(zhuyifei1999_: forcing puppet run on tools-k8s-master-01)
imported>Stashbot
(lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858)
(505 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2019-02-21 ===
=== 2022-09-28 ===
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)


=== 2019-02-20 ===
=== 2022-09-22 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442


=== 2019-02-19 ===
=== 2022-09-10 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko


=== 2019-02-18 ===
=== 2022-09-07 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness


=== 2019-02-17 ===
=== 2022-09-06 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever


=== 2019-02-16 ===
=== 2022-08-25 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP


=== 2019-02-14 ===
=== 2022-08-24 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 12:20 taavi: upgrading ingress-nginx to v1.3
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r


=== 2019-02-13 ===
=== 2022-08-20 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07


=== 2019-02-12 ===
=== 2022-08-18 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair


=== 2019-02-11 ===
=== 2022-08-17 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1


=== 2019-02-08 ===
=== 2022-08-16 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07


=== 2019-02-07 ===
=== 2022-08-11 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]


=== 2019-02-04 ===
=== 2022-08-05 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06


=== 2019-01-30 ===
=== 2022-08-03 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2019-01-25 ===
=== 2022-07-20 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]


=== 2019-01-24 ===
=== 2022-07-19 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2019-01-23 ===
=== 2022-07-17 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])


=== 2019-01-22 ===
=== 2022-07-14 ===
* 20:21 gtirloni: published new docker images (all)
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs


=== 2019-01-21 ===
=== 2022-07-13 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2019-01-18 ===
=== 2022-07-11 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon


=== 2019-01-17 ===
=== 2022-07-07 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04


=== 2019-01-16 ===
=== 2022-06-28 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04


=== 2019-01-15 ===
=== 2022-06-27 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2019-01-14 ===
=== 2022-06-23 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2019-01-11 ===
=== 2022-06-22 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2019-01-10 ===
=== 2022-06-21 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2019-01-09 ===
=== 2022-06-03 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2019-01-07 ===
=== 2022-06-02 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-01-06 ===
=== 2022-06-01 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]


=== 2019-01-05 ===
=== 2022-05-31 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation


=== 2019-01-04 ===
=== 2022-05-30 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]


=== 2019-01-03 ===
=== 2022-05-26 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01


=== 2018-12-21 ===
=== 2022-05-22 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]


=== 2018-12-20 ===
=== 2022-05-16 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002


=== 2018-12-17 ===
=== 2022-05-14 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`


=== 2018-12-11 ===
=== 2022-05-12 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2018-12-05 ===
=== 2022-05-10 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]


=== 2018-12-04 ===
=== 2022-05-06 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])


=== 2018-12-01 ===
=== 2022-05-05 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts


=== 2018-11-30 ===
=== 2022-05-03 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])


=== 2018-11-27 ===
=== 2022-05-02 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2018-11-26 ===
=== 2022-04-25 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 14:46 bd808: Building toolforge-webservice v0.82
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])


=== 2018-11-20 ===
=== 2022-04-23 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 09:43 godog: restart prometheus@tools on prometheus-01


=== 2018-11-16 ===
=== 2022-04-20 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 17:47 gtirloni: deleted tools-mail instance
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades


=== 2018-11-14 ===
=== 2022-04-16 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009


=== 2018-11-13 ===
=== 2022-04-12 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit


=== 2018-11-08 ===
=== 2022-04-10 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])


=== 2018-11-07 ===
=== 2022-04-09 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /


=== 2018-11-02 ===
=== 2022-04-08 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]


=== 2018-10-31 ===
=== 2022-04-05 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2018-10-29 ===
=== 2022-04-04 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions


=== 2018-10-26 ===
=== 2022-03-28 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo


=== 2018-10-19 ===
=== 2022-03-15 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)


=== 2018-10-18 ===
=== 2022-03-14 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])


=== 2018-10-16 ===
=== 2022-03-10 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902


=== 2018-10-07 ===
=== 2022-03-01 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2018-09-21 ===
=== 2022-02-28 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 08:02 taavi: reboot sgeexec-0916
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines


=== 2018-09-17 ===
=== 2022-02-17 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
* 08:23 taavi: deleted tools-clushmaster-02
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access


=== 2018-09-14 ===
=== 2022-02-16 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 00:12 bd808: Image builds completed.
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API


=== 2018-09-08 ===
=== 2022-02-15 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2018-09-07 ===
=== 2022-02-10 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2018-08-27 ===
=== 2022-02-09 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2018-08-22 ===
=== 2022-02-07 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]


=== 2018-08-19 ===
=== 2022-02-04 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 21:36 taavi: clear error state from some webgrid nodes


=== 2018-08-14 ===
=== 2022-02-03 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate


=== 2018-08-13 ===
=== 2022-01-30 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice


=== 2018-08-09 ===
=== 2022-01-26 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2018-08-08 ===
=== 2022-01-25 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2018-08-06 ===
=== 2022-01-24 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2018-08-01 ===
=== 2022-01-20 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2018-07-30 ===
=== 2022-01-19 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools


=== 2018-07-27 ===
=== 2022-01-14 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]


=== 2018-07-25 ===
=== 2022-01-12 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2018-07-18 ===
=== 2022-01-04 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]
 
=== 2018-06-30 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
 
=== 2018-06-29 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]
 
=== 2018-06-28 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 16:48 arturo: rebooting tools-docker-registry-01
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
 
=== 2018-06-21 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-20 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
 
=== 2018-06-14 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-11 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
 
=== 2018-06-08 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
 
=== 2018-06-07 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
 
=== 2018-06-06 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
 
=== 2018-06-05 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])
 
=== 2018-06-04 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
 
=== 2018-06-03 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
 
=== 2018-05-31 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
 
=== 2018-05-30 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]
 
=== 2018-05-28 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
 
=== 2018-05-25 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
 
=== 2018-05-22 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
 
=== 2018-05-18 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
 
=== 2018-05-16 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
 
=== 2018-05-15 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414.  It's hanging for unknown reasons.
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding
 
=== 2018-05-12 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
 
=== 2018-05-11 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2018-05-10 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
 
=== 2018-05-09 ===
* 21:11 Reedy: Added Tim Starling as member/admin
 
=== 2018-05-07 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
 
=== 2018-05-05 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
 
=== 2018-05-03 ===
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
 
=== 2018-05-01 ===
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
 
=== 2018-04-27 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
 
=== 2018-04-23 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
 
=== 2018-04-22 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`
 
=== 2018-04-15 ===
* 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] [[phab:T192224|T192224]]
* 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci [[phab:T192224|T192224]]
 
=== 2018-04-11 ===
* 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
 
=== 2018-04-06 ===
* 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
* 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to [[phab:T159254|T159254]]
* 11:23 arturo: manually upgrade apache2 on tools-puppemaster for [[phab:T159254|T159254]]
 
=== 2018-04-05 ===
* 18:46 chicocvenancio: killed wget that was hogging io
 
=== 2018-03-29 ===
* 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
* 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done
 
=== 2018-03-28 ===
* 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
 
=== 2018-03-26 ===
* 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-23 ===
* 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/
 
=== 2018-03-22 ===
* 22:04 bd808: Forced puppet run on tools-proxy-02 for [[phab:T130748|T130748]]
* 21:52 bd808: Forced puppet run on tools-proxy-01 for [[phab:T130748|T130748]]
* 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
* 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-21 ===
* 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
* 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid ([[phab:T190185|T190185]])
 
=== 2018-03-20 ===
* 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) [[phab:T189018|T189018]] [[phab:T190126|T190126]]
 
=== 2018-03-19 ===
* 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
 
=== 2018-03-16 ===
* 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
* 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp
 
=== 2018-03-15 ===
* 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot [[phab:T185624|T185624]]
 
=== 2018-03-14 ===
* 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 ([[phab:T181531|T181531]])
* 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 ([[phab:T181531|T181531]])
* 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 ([[phab:T181531|T181531]])
* 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
* 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
* 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full
 
=== 2018-03-12 ===
* 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
* 17:13 arturo: [[phab:T188994|T188994]] upgrading packages from `stable`
* 16:53 arturo: [[phab:T188994|T188994]] upgrading packages from stretch-wikimedia
* 16:33 arturo: [[phab:T188994|T188994]] upgrading packages form jessie-wikimedia
* 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 {{Gerrit|5f3561e}} [[phab:T189430|T189430]]
* 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
* 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
* 13:19 arturo: [[phab:T188994|T188994]] upgrade packages from jessie-backports in all jessie servers
* 12:49 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-updates in all ubuntu servers
* 12:34 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-wikimedia in all ubuntu servers
 
=== 2018-03-08 ===
* 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
* 14:02 arturo: [[phab:T188994|T188994]] upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
 
=== 2018-03-07 ===
* 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
* 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
* 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
* 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/
 
=== 2018-03-06 ===
* 16:15 madhuvishy: Reboot tools-docker-registry-02 [[phab:T189018|T189018]]
* 15:50 madhuvishy: Rebooting tools-worker-1011
* 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
* 15:03 arturo: drain and reboot tools-worker-1011
* 15:03 chasemp: rebooted tools-worker 1001-1008
* 14:58 arturo: drain and reboot tools-worker-1010
* 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
* 14:27 chasemp: reboot tools-worker-100[12]
* 14:23 chasemp: downtime icinga alert for k8s workers ready
* 13:21 arturo: [[phab:T188994|T188994]] in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
* 12:58 arturo: [[phab:T188994|T188994]] upgrading packages in jessie nodes from the oldstable source
* 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
* 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
* 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic ([[phab:T188911|T188911]])
* 11:33 arturo: removing unused kernel packages in ubuntu nodes
* 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster
 
=== 2018-03-05 ===
* 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
* 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb [[phab:T167026|T167026]] [[phab:T181492|T181492]]
* 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for [[phab:T188911|T188911]]
* 14:01 arturo: deleting old kernel packages in jessie instances for [[phab:T188911|T188911]]
* 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
* 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for [[phab:T187193|T187193]]
* 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for [[phab:T187193|T187193]]
 
=== 2018-03-02 ===
* 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
 
=== 2018-03-01 ===
* 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
 
=== 2018-02-27 ===
* 17:37 chasemp: add chico as admin to toolsbeta
* 12:23 arturo: running `apt-get autoclean` in canary servers
* 12:16 arturo: running `apt-get autoremove` in canary servers
 
=== 2018-02-26 ===
* 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
* 10:35 arturo: enable puppet in tools-proxy-01
* 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests
 
=== 2018-02-25 ===
* 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
 
=== 2018-02-23 ===
* 19:11 arturo: enable puppet in tools-proxy-01
* 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
* 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
* 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
 
=== 2018-02-22 ===
* 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
 
=== 2018-02-21 ===
* 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
* 18:15 arturo: puppet should be fine across the fleet
* 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
* 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
* 16:59 arturo: puppet is broken across the cluster due to last change
* 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
* 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
* 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
* 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
* 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
* 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
* 09:18 chicocvenancio: killed io intensive tool job in bastion
* 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...
 
=== 2018-02-20 ===
* 12:42 arturo: upgrading tools-flannel-etcd-01
* 12:42 arturo: upgrading tools-k8s-etcd-01
 
=== 2018-02-19 ===
* 19:13 arturo: upgrade all packages of tools-services-01
* 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
* 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
* 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration
 
=== 2018-02-16 ===
* 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
* 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
* 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
* 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
* 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
* 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
* 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
* 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y
 
=== 2018-02-15 ===
* 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for [[phab:T187435|T187435]]
* 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
* 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
* 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
* 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
* 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
* 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
 
=== 2018-02-14 ===
* 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment ([[phab:T187315|T187315]])
* 13:04 arturo: reboot tools-paws-master-01 for [[phab:T187315|T187315]]
 
=== 2018-02-11 ===
* 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
* 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
 
=== 2018-02-09 ===
* 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ [[phab:T179343|T179343]] [[phab:T182562|T182562]] [[phab:T186846|T186846]]
* 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
* 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
* 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
* 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
* 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
* 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 ([[phab:T186830|T186830]])
* 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there
 
=== 2018-02-08 ===
* 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
* 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
* 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
* 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
* 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
* 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
* 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
* 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
* 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
* 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
* 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
* 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
* 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
* 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.
 
=== 2018-02-06 ===
* 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
* 13:05 arturo: unpublish/publish trusty-tools repo
* 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for [[phab:T186539|T186539]] after adding it to trusty-tools repo (self contained)
 
=== 2018-02-05 ===
* 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address [[phab:T186539|T186539]]
* 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
* 13:06 arturo: deploying fix for [[phab:T186230|T186230]] using clush
 
=== 2018-02-03 ===
* 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools  python3 ./broken_ref_anchors.py"
 
=== 2018-01-31 ===
* 22:54 chasemp: add bstorm to sudoers as root
 
=== 2018-01-29 ===
* 20:02 chasemp: add zhuyifei1999_ tools root for  [[phab:T185577|T185577]]
* 20:01 chasemp: blast a puppet run to see if any errors are persistent
 
=== 2018-01-28 ===
* 22:49 chicocvenancio: killed compromised session generating miner processes
* 22:48 chicocvenancio: killed miner processes in tools-bastion-03
 
=== 2018-01-27 ===
* 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
* 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
 
=== 2018-01-25 ===
* 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
* 23:20 arturo: [[phab:T179386|T179386]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 05:25 arturo: deploying misctools and jobutils 1.29 for [[phab:T179386|T179386]]
 
=== 2018-01-23 ===
* 19:41 madhuvishy: Add bstorm to project admins
* 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
* 14:17 chasemp: add me, arturo, chico to sudoers and removed marc
 
=== 2018-01-22 ===
* 18:32 arturo: [[phab:T181948|T181948]] [[phab:T185314|T185314]] deploying jobutils and misctools v1.28 in the cluster
* 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
* 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
* 10:18 arturo: [[phab:T181948|T181948]] deploy misctools 1.27 in the cluster
 
=== 2018-01-19 ===
* 17:32 arturo: [[phab:T185314|T185314]] deploying new version of jobutils 1.27
* 12:56 arturo: the puppet status across the fleet seems good, only minor things like [[phab:T185314|T185314]] , [[phab:T179388|T179388]] and [[phab:T179386|T179386]]
* 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
 
=== 2018-01-18 ===
* 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to [[phab:T182781|T182781]])
* 15:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 13:52 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter {{!}} grep lsbdistcodename {{!}} grep trusty && sudo apt-upgrade trusty-wikimedia -v'
* 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
* 12:24 arturo: [[phab:T178717|T178717]] aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
* 12:11 arturo: [[phab:T178717|T178717]] aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
* 11:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-17 ===
* 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions {{!}} grep upgradeable {{!}} grep trusty-wikimedia' {{!}} tee pending-upgrades-report-trusty-wikimedia.txt
* 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' {{!}} tee pending-upgrades-report.txt
* 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
* 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
* 15:04 andrewbogott: depooling exec-manage tools-exec-1430.  Experimenting with purge-old-kernels
* 14:09 arturo: [[phab:T181647|T181647]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-16 ===
* 22:01 chasemp: qstat -explain E -xml {{!}} grep 'name' {{!}} sed 's/<name>//' {{!}} sed 's/<\/name>//'  {{!}} xargs qmod -cq
* 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
* 21:24 andrewbogott: repooled tools-exec-1420  and tools-webgrid-lighttpd-1417
* 21:14 andrewbogott: depooling tools-exec-1420  and tools-webgrid-lighttpd-1417
* 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412  and tools-exec-1423 for host reboot
* 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413  tools-exec-1442 for host reboot
* 18:50 andrewbogott: switched active proxy back to tools-proxy-02
* 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
* 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
* 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
* 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
* 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
* 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
* 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
* 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
* 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
* 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
* 13:35 chasemp: tools-mail  almouked@ltnet.net 719 pending messages cleared
 
=== 2018-01-11 ===
* 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
* 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
* 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 19:00 chasemp: reboot tools-worker-1015
* 15:08 chasemp: reboot tools-exec-1405
* 15:06 chasemp: reboot tools-exec-1404
* 15:06 chasemp: reboot tools-exec-1403
* 15:02 chasemp: reboot tools-exec-1402
* 14:57 chasemp: reboot tools-exec-1401 again...
* 14:53 chasemp: reboot tools-exec-1401
* 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot
 
=== 2018-01-10 ===
* 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
* 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
* 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
* 13:57 arturo: [[phab:T184604|T184604]] cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
* 13:46 arturo: [[phab:T184604|T184604]] aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
* 13:45 arturo: [[phab:T184604|T184604]] aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
* 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
* 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
* 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
* 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for [[phab:T184604|T184604]]
* 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened [[phab:T184604|T184604]]
 
=== 2018-01-09 ===
* 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
* 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
* 23:01 yuvipanda: kill paws master and reboot it
* 22:54 yuvipanda: kill all kube-system pods in paws cluster
* 22:54 yuvipanda: kill all PAWS pods
* 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
* 22:49 yuvipanda: run  clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
* 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes
* 22:46 yuvipanda: reboot all paws-worker nodes
* 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
* 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
* 20:55 chasemp: for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016`; do kubectl cordon $n; done
* 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
* 20:15 chasemp: disable puppet on proxies and k8s workers
* 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
* 19:42 chasemp: reboot tools-worker-1010
 
=== 2018-01-08 ===
* 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
* 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
 
=== 2018-01-06 ===
* 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
* 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)
 
=== 2018-01-05 ===
* 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
* 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
* 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
* 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)
 
=== 2018-01-04 ===
* 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of [[phab:T184018|T184018]]
 
=== 2018-01-03 ===
* 15:38 bd808: Forced Puppet run on tools-services-01
* 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush


==Archives==
==Archives==
* [[/Archive 1|Archive 1]] (2013-2014)
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
* [[/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
</noinclude>
</noinclude>
{{SAL|Project Name=tools}}
{{SAL|Project Name=tools}}
<noinclude>[[Category:SAL]]</noinclude>
<noinclude>[[Category:SAL]]</noinclude>

Revision as of 21:23, 28 September 2022

2022-09-28

  • 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858
  • 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, T318858
  • 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)

2022-09-22

  • 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group T317438
  • 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy T317438

2022-09-10

  • 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko

2022-09-07

  • 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks (T316854)

2022-09-06

  • 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder (T316854)

2022-08-25

  • 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner T293552

2022-08-24

2022-08-20

  • 07:44 dcaro_away: all k8s nodes ready now \o/ (T315718)
  • 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up (T315718)
  • 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking (T315718)

2022-08-18

  • 14:45 andrewbogott: adding lucaswerkmeister as projectadmin (T314527)
  • 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair

2022-08-17

  • 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # T315459
  • 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected

2022-08-16

  • 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05

2022-08-11

  • 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
  • 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues

2022-08-05

  • 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-08-03

2022-07-20

  • 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
  • 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-07-19

  • 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest

2022-07-17

  • 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-07-14

  • 13:48 taavi: rebooting tools-sgeexec-10-2

2022-07-13

  • 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-07-11

  • 16:06 wm-bot2: Increased quotas by {self.increases} (T312692) - cookbook ran by nskaggs@x1carbon

2022-07-07

  • 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-06-28

  • 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master (T311538) - cookbook ran by dcaro@vulcanus
  • 15:51 taavi: add 4096G cinder quota T311509

2022-06-27

  • 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
  • 18:02 taavi: switchover active cron server to tools-sgecron-2 T284767
  • 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:15 taavi: T311412 updating ca used by k8s-apiserver->etcd communication, breakage may happen
  • 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 T311412
  • 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it T311412

2022-06-23

  • 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653

2022-06-22

  • 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-21

  • 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-03

  • 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
  • 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
  • 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
  • 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
  • 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
  • 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
  • 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
  • 13:20 bd808: publish tools-webservice 0.86 (T309821)
  • 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
  • 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
  • 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
  • 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
  • 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
  • 03:10 bd808: publish tools-webservice 0.85 with hack for T309821

2022-06-02

  • 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
  • 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
  • 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
  • 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
  • 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
  • 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:03 dcaro: refresh prometheus certs (T308402)
  • 11:47 dcaro: refresh registry-admission-controller certs (T308402)
  • 11:42 dcaro: refresh ingress-admission-controller certs (T308402)
  • 11:36 dcaro: refresh volume-admission-controller certs (T308402)
  • 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster T277653
  • 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653
  • 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-06-01

  • 11:18 taavi: depool and remove tools-sgeexec-09[07-14]

2022-05-31

  • 16:51 taavi: delete tools-sgeexec-0904 for T309525 experimentation

2022-05-30

  • 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) T277653

2022-05-26

2022-05-22

  • 17:04 taavi: failover tools-redis to the updated cluster T278541
  • 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud (T308982) - cookbook ran by taavi@runko

2022-05-16

2022-05-14

  • 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940

2022-05-12

2022-05-10

  • 15:18 taavi: depool tools-k8s-worker-42 for experiments
  • 13:54 taavi: enable distro-wikimedia unattended upgrades T290494

2022-05-06

  • 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl (T307812)

2022-05-05

  • 17:28 taavi: deploy tools-webservice 0.83 T307693

2022-05-03

  • 08:20 taavi: redis: start replication from the old cluster to the new one (T278541)

2022-05-02

  • 08:54 taavi: restart acme-chief.service T307333

2022-04-25

  • 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 (T214343)
  • 14:46 bd808: Building toolforge-webservice v0.82

2022-04-23

  • 16:51 bd808: Built new perl532-sssd/{base,web} images and pushed to registry (T214343)

2022-04-20

2022-04-16

2022-04-12

  • 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' (T305986)
  • 21:27 bd808: Added komla to 'roots' sudoers policy (T305986)
  • 21:24 bd808: Add komla as projectadmin (T305986)

2022-04-10

  • 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)

2022-04-09

  • 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /

2022-04-08

  • 10:44 arturo: disabled debug mode on the k8s jobs-emailer component

2022-04-05

2022-04-04

2022-03-28

  • 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo

2022-03-15

2022-03-14

  • 11:44 arturo: deploy jobs-framework-emailer 9470a5f (T286135)
  • 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo (T297090)

2022-03-10

  • 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902

2022-03-01

  • 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state (T302702)
  • 12:11 dcaro: Cleared error state queues for sgeexec-0916 (T302702)
  • 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand

2022-02-28

  • 08:02 taavi: reboot sgeexec-0916
  • 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /

2022-02-17

  • 08:23 taavi: deleted tools-clushmaster-02
  • 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access

2022-02-16

  • 00:12 bd808: Image builds completed.

2022-02-15

  • 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
  • 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
  • 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
  • 22:50 bd808: Built new toollabs-webservice 0.81
  • 18:43 bd808: Enabled puppet on tools-proxy-05
  • 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
  • 18:21 taavi: delete tools-package-builder-03
  • 11:49 arturo: invalidate sssd cache in all bastions to debug T301736
  • 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for T301736
  • 11:15 arturo: reboot tools-sgebastion-10 for T301736

2022-02-10

  • 15:07 taavi: shutdown tools-clushmaster-02 T298191
  • 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally T214427
  • 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - T214427
  • 08:06 taavi: disable puppet globally for enabling puppetdb T214427

2022-02-09

  • 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet T214427
  • 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 18:25 arturo: ignore last message
  • 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 T298191

2022-02-07

  • 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository (T288406)
  • 12:52 taavi: updated maintain-kubeusers for T301081

2022-02-04

  • 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015
  • 21:36 taavi: clear error state from some webgrid nodes

2022-02-03

  • 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
  • 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate

2022-01-30

  • 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover T278541
  • 14:22 taavi: creating a cluster of 3 bullseye redis hosts for T278541

2022-01-26

  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
  • 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
  • 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
  • 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes (T277653)

2022-01-25

  • 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 11:44 arturo: rebooting buster exec nodes
  • 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4

2022-01-24

  • 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 15:23 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-20

  • 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
  • 12:56 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-19

  • 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move

2022-01-14

  • 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, T299243

2022-01-12

  • 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
  • 11:03 arturo: created puppet prefix 'tools-sgeweblig'
  • 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'

2022-01-04

  • 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
  • 08:12 taavi: disable puppet & exim4 on T298501

Archives