You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly)
imported>Stashbot
(lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858)
(348 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2020-01-16 ===
=== 2022-09-28 ===
* 23:54 bstorm_: rebooting tools-docker-builder-06 because there are a couple running containers that don't want to die cleanly
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 23:45 bstorm_: rebuilding docker containers to include new webservice version (0.58)
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 23:41 bstorm_: deployed toollabs-webservice 0.58 to everything that isn't a container
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)
* 16:45 bstorm_: ran configurator to set the gridengine web queues to `rerun FALSE` [[phab:T242397|T242397]]


=== 2020-01-14 ===
=== 2022-09-22 ===
* 15:29 bstorm_: failed the gridengine master back to the master server from the shadow
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 02:23 andrewbogott: rebooting tools-paws-worker-1006  to resolve hangs associated with an old NFS failure
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]


=== 2020-01-13 ===
=== 2022-09-10 ===
* 17:48 bd808: Running `puppet ca destroy` for each unsigned cert on tools-puppetmaster-01 ([[phab:T242642|T242642]])
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko
* 16:42 bd808: Cordoned and fixed puppet on tools-k8s-worker-12. Rebooting now. [[phab:T242559|T242559]]
* 16:33 bd808: Cordoned and fixed puppet on tools-k8s-worker-11. Rebooting now. [[phab:T242559|T242559]]
* 16:31 bd808: Cordoned and fixed puppet on tools-k8s-worker-10. Rebooting now. [[phab:T242559|T242559]]
* 16:26 bd808: Cordoned and fixed puppet on tools-k8s-worker-9. Rebooting now. [[phab:T242559|T242559]]


=== 2020-01-12 ===
=== 2022-09-07 ===
* 22:31 Krenair: same on -13 and -14
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])
* 22:28 Krenair: same on -8
* 22:18 Krenair: same on -7
* 22:11 Krenair: Did usual new instance creation puppet dance on tools-k8s-worker-6, /data/project got created


=== 2020-01-11 ===
=== 2022-09-06 ===
* 01:33 bstorm_: updated toollabs-webservice package to 0.57, which should allow persisting mem and cpu in manifests with burstable qos.
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])


=== 2020-01-10 ===
=== 2022-08-25 ===
* 23:31 bstorm_: updated toollabs-webservice package to 0.56
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]
* 15:45 bstorm_: depooled tools-paws-worker-1013 to reboot because I think it is the last tools server with that mount issue (I hope)
* 15:35 bstorm_: depooling and rebooting tools-worker-1016 because it still had the leftover mount problems
* 15:30 bstorm_: git stash-ing local puppet changes in hopes that arturo has that material locally, and it doesn't break anything to do so


=== 2020-01-09 ===
=== 2022-08-24 ===
* 23:35 bstorm_: depooled tools-sgeexec-0939 because it isn't acting right and rebooting it
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 18:26 bstorm_: re-joining the k8s nodes OF THE PAWS CLUSTER to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 12:20 taavi: upgrading ingress-nginx to v1.3
* 18:25 bstorm_: re-joining the k8s nodes to the cluster one at a time to rotate the certs [[phab:T242353|T242353]]
* 18:06 bstorm_: rebooting tools-paws-master-01 [[phab:T242353|T242353]]
* 17:46 bstorm_: refreshing the paws cluster's entire x509 environment [[phab:T242353|T242353]]


=== 2020-01-07 ===
=== 2022-08-20 ===
* 22:40 bstorm_: rebooted tools-worker-1007 to recover it from disk full and general badness
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 16:33 arturo: deleted by hand pod metrics/cadvisor-5pd46 due to prometheus having issues scrapping it
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 15:46 bd808: Rebooting tools-k8s-worker-[6-14]
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 15:35 bstorm_: changed kubeadm-config to use a list instead of a hash for extravols on the apiserver in the new k8s cluster [[phab:T242067|T242067]]
* 14:02 arturo: `root@tools-k8s-control-3:~# wmcs-k8s-secret-for-cert -n metrics -s metrics-server-certs -a metrics-server` ([[phab:T241853|T241853]])
* 13:33 arturo: upload docker-registry.tools.wmflabs.org/coreos/kube-state-metrics:v1.8.0 copied from quay.io/coreos/kube-state-metrics:v1.8.0 ([[phab:T241853|T241853]])
* 13:31 arturo: upload docker-registry.tools.wmflabs.org/metrics-server-amd64:v0.3.6 copied from k8s.gcr.io/metrics-server-amd64:v0.3.6 ([[phab:T241853|T241853]])
* 13:23 arturo: [new k8s] doing changes to kube-state-metrics and metrics-server trying to relocate them to the 'metrics' namespace ([[phab:T241853|T241853]])
* 05:28 bd808: Creating tools-k8s-worker-[6-14] (again)
* 05:20 bd808: Deleting busted tools-k8s-worker-[6-14]
* 05:02 bd808: Creating tools-k8s-worker-[6-14]
* 00:26 bstorm_: repooled tools-sgewebgrid-lighttpd-0919
* 00:17 bstorm_: repooled tools-sgewebgrid-lighttpd-0918
* 00:15 bstorm_: moving tools-sgewebgrid-lighttpd-0918 and -0919 to cloudvirt1004 from cloudvirt1029 to rebalance load
* 00:02 bstorm_: depooled tools-sgewebgrid-lighttpd-0918 and 0919 to move to cloudvirt1004 to improve spread


=== 2020-01-06 ===
=== 2022-08-18 ===
* 23:40 bd808: Deleted tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 23:36 bd808: Shutdown tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair
* 23:34 bd808: Decommissioned tools-sgewebgrid-lighttpd-09<nowiki>{</nowiki>0[1-9],10<nowiki>}</nowiki>
* 23:13 bstorm_: Repooled tools-sgeexec-0922 because I don't know why it was depooled
* 23:01 bd808: Depooled tools-sgewebgrid-lighttpd-0910.tools.eqiad.wmflabs
* 22:58 bd808: Depooling tools-sgewebgrid-lighttpd-090[2-9]
* 22:57 bd808: Disabling queues on tools-sgewebgrid-lighttpd-090[2-9]
* 21:07 bd808: Restarted kube2proxy on tools-proxy-05 to try and refresh admin tool's routes
* 18:54 bstorm_: edited /etc/fstab to remove NFS and unmounted the nfs volumes tools-k8s-haproxy-1 [[phab:T241908|T241908]]
* 18:49 bstorm_: edited /etc/fstab to remove NFS and rebooted to clear stale mounts on tools-k8s-haproxy-2 [[phab:T241908|T241908]]
* 18:47 bstorm_: added mount_nfs=false to tools-k8s-haproxy puppet prefix [[phab:T241908|T241908]]
* 18:24 bd808: Deleted shutdown instance tools-worker-1029 (was an SSSD testing instance)
* 16:42 bstorm_: failed sge-shadow-master back to the main grid master
* 16:42 bstorm_: Removed files for old S1tty that wasn't working on sge-grid-master


=== 2020-01-04 ===
=== 2022-08-17 ===
* 18:11 bd808: Shutdown tools-worker-1029
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 18:10 bd808: kubectl delete node tools-worker-1029.tools.eqiad.wmflabs
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected
* 18:06 bd808: Removed tools-worker-1029.tools.eqiad.wmflabs from k8s::worker_hosts hiera in preparation for decom
* 16:54 bstorm_: moving VMs tools-worker-1012/1028/1005 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:47 bstorm_: moving VM tools-flannel-etcd-02 from cloudvirt1024 to cloudvirt1003 due to hardware errors [[phab:T241884|T241884]]
* 16:16 bd808: Draining tools-worker-10<nowiki>{</nowiki>05,12,28<nowiki>}</nowiki> due to hardware errors ([[phab:T241884|T241884]])
* 16:13 arturo: moving VM tools-sgewebgrid-lighttpd-0927 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:11 arturo: moving VM tools-sgewebgrid-lighttpd-0926 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:09 arturo: moving VM tools-sgewebgrid-lighttpd-0925 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:08 arturo: moving VM tools-sgewebgrid-lighttpd-0924 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:07 arturo: moving VM tools-sgewebgrid-lighttpd-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:06 arturo: moving VM tools-sgewebgrid-lighttpd-0909 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:04 arturo: moving VM tools-sgeexec-0923 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241884|T241884]])
* 16:02 arturo: moving VM tools-sgeexec-0910 from cloudvirt1024 to cloudvirt1009 due to hardware errors ([[phab:T241873|T241873]])


=== 2020-01-03 ===
=== 2022-08-16 ===
* 16:48 bstorm_: updated the ValidatingWebhookConfiguration for the ingress admission controller to the working settings
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05
* 11:51 arturo: [new k8s] deploy cadvisor as in https://gerrit.wikimedia.org/r/c/operations/puppet/+/561654 ([[phab:T237643|T237643]])
* 11:21 arturo: upload k8s.gcr.io/cadvisor:v0.30.2 docker image to the docker registry as docker-registry.tools.wmflabs.org/cadvisor:0.30.2 for [[phab:T237643|T237643]]
* 03:04 bd808: Really rebuilding all <nowiki>{</nowiki>jessie,stretch,buster<nowiki>}</nowiki>-sssd images. Last time I forgot to actually update the git clone.
* 00:11 bd808: Rebuiliding all stretch-ssd Docker images to pick up busybox


=== 2020-01-02 ===
=== 2022-08-11 ===
* 23:54 bd808: Rebuiliding all buster-ssd Docker images to pick up busybox
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues


=== 2019-12-30 ===
=== 2022-08-05 ===
* 05:02 andrewbogott: moving tools-worker-1012 to cloudvirt1024 for [[phab:T241523|T241523]]
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 04:49 andrewbogott: draining and rebooting tools-worker-1031, its drive is full
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-12-29 ===
=== 2022-08-03 ===
* 01:38 Krenair: Cordoned tools-worker-1012 and deleted pods associated with dplbot and dewikigreetbot as well as my own testing one, host seems to be under heavy load - [[phab:T241523|T241523]]
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2019-12-27 ===
=== 2022-07-20 ===
* 15:06 Krenair: Killed a "python parse_page.py outreachy" process by aikochou that was hogging IO on tools-sgebastion-07
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2019-12-25 ===
=== 2022-07-19 ===
* 16:07 zhuyifei1999_: pkilled 5 `python pwb.py` processes belonging to `tools.kaleem-bot` on tools-sgebastion-07
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2019-12-22 ===
=== 2022-07-17 ===
* 20:13 bd808: Enabled Puppet on tools-proxy-06.tools.eqiad.wmflabs after nginx config test ([[phab:T241310|T241310]])
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:52 bd808: Disabled Puppet on tools-proxy-06.tools.eqiad.wmflabs to test nginx config change ([[phab:T241310|T241310]])
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-12-20 ===
=== 2022-07-14 ===
* 22:28 bd808: Re-enabled Puppet on tools-sgebastion-09. Reason for disable was "arturo raising systemd limits"
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 11:33 arturo: reboot tools-k8s-control-3 to fix some stale NFS mount issues


=== 2019-12-18 ===
=== 2022-07-13 ===
* 17:33 bstorm_: updated package in aptly for toollabs-webservice to 0.53
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 11:49 arturo: introduce placeholder DNS records for toolforge.org domain. No services are provided under this domain yet for end users, this is just us testing (SSL, proxy stuff etc). This may be reverted anytime.


=== 2019-12-17 ===
=== 2022-07-11 ===
* 20:25 bd808: Fixed https://tools.wmflabs.org/ to redirect to https://tools.wmflabs.org/admin/
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon
* 19:21 bstorm_: deployed the changes to the live proxy to enable the new kubernetes cluster [[phab:T234037|T234037]]
* 16:53 bstorm_: maintain-kubeusers app deployed fully in tools for new kubernetes cluster [[phab:T214513|T214513]] [[phab:T228499|T228499]]
* 16:50 bstorm_: updated the maintain-kubeusers docker image for beta and tools
* 04:48 bstorm_: completed first run of maintain-kubeusers 2 in the new cluster [[phab:T214513|T214513]]
* 01:26 bstorm_: running the first run of maintain-kubeusers 2.0 for the new cluster [[phab:T214513|T214513]] (more successfully this time)
* 01:25 bstorm_: unset the immutable bit from 1704 tool kubeconfigs [[phab:T214513|T214513]]
* 01:05 bstorm_: beginning the first run of the new maintain-kubeusers in gentle-mode -- but it was just killed by some files setting the immutable bit [[phab:T214513|T214513]]
* 00:45 bstorm_: enabled encryption at rest on the new k8s cluster


=== 2019-12-16 ===
=== 2022-07-07 ===
* 22:04 bd808: Added 'ALLOW IPv4 25/tcp from 0.0.0.0/0' to "MTA" security group applied to tools-mail-02
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 19:05 bstorm_: deployed the maintain-kubeusers operations pod to the new cluster


=== 2019-12-14 ===
=== 2022-06-28 ===
* 10:48 valhallasw`cloud: re-enabling puppet on tools-sgeexec-0912, likely left-over from NFS maintenance (no reason was specified).
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]


=== 2019-12-13 ===
=== 2022-06-27 ===
* 18:46 bstorm_: updated tools-k8s-control-2 and 3 to the new config as well
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 17:56 bstorm_: updated tools-k8s-control-1 to the new control plane configuration
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 17:47 bstorm_: edited kubeadm-config configMap object to match the new init config
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 bstorm_: rebooting tools-k8s-control-2 to correct mount issue
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 00:45 bstorm_: rebooting tools-static-13
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 00:28 bstorm_: rebooting the k8s master to clear NFS errors
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 00:15 bstorm_: switch tools-acme-chief config to match the new authdns_servers format upstream
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2019-12-12 ===
=== 2022-06-23 ===
* 23:36 bstorm_: rebooting toolschecker after downtiming the services
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:58 bstorm_: rebooting tools-acme-chief-01
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:53 bstorm_: rebooting the cron server, tools-sgecron-01 as it wasn't recovered from last night's maintenance
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 11:20 arturo: rolling reboot for all grid & k8s worker nodes due to NFS staleness
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 09:22 arturo: reboot tools-sgeexec-0911 to try fixing weird NFS state
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 08:46 arturo: doing `run-puppet-agent` in all VMs to see state of NFS
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 08:34 arturo: reboot tools-worker-1033/1034 and tools-sgebastion-08 to try to correct NFS mount issues
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2019-12-11 ===
=== 2022-06-22 ===
* 18:13 bd808: Restarted maintain-dbusers on labstore1004. Process had not logged any account creations since 2019-12-01T22:45:45.
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:24 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1031
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2019-12-10 ===
=== 2022-06-21 ===
* 13:59 arturo: set pod replicas to 3 in the new k8s cluster ([[phab:T239405|T239405]])
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2019-12-09 ===
=== 2022-06-03 ===
* 11:06 andrewbogott: deleting unused security groups: catgraph, devpi, MTA, mysql, syslog, test    [[phab:T91619|T91619]]
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2019-12-04 ===
=== 2022-06-02 ===
* 13:45 arturo: drop puppet prefix `tools-cron`, deprecated and no longer in use
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2019-11-29 ===
=== 2022-06-01 ===
* 11:45 arturo: created 3 new VMs `tools-k8s-worker-[3,4,5]` ([[phab:T239403|T239403]])
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]
* 10:28 arturo: re-arm keyholder in tools-acme-chief-01 (password in labs/private.git @ tools-puppetmaster-01)
* 10:27 arturo: re-arm keyholder in tools-acme-chief-02 (password in labs/private.git @ tools-puppetmaster-01)


=== 2019-11-26 ===
=== 2022-05-31 ===
* 23:25 bstorm_: rebuilding docker images to include the new webservice 0.52 in all versions instead of just the stretch ones [[phab:T236202|T236202]]
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 22:57 bstorm_: push upgraded webservice 0.52 to the buster and jessie repos for container rebuilds [[phab:T236202|T236202]]
* 19:55 phamhi: drained tools-worker-1002,8,15,32 to rebalance the cluster
* 19:45 phamhi: cleaned up container that was taken up 16G of disk space on tools-worker-1020 in order to re-run puppet client
* 14:01 arturo: drop hiera references to `tools-test-proxy-01.tools.eqiad.wmflabs`. Such VM no longer exists
* 14:00 arturo: introduce the `profile::toolforge::proxies` hiera key in the global puppet config


=== 2019-11-25 ===
=== 2022-05-30 ===
* 10:35 arturo: refresh puppet certs for tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]
* 10:35 arturo: add puppet cert SANs via instance hiera to tools-k8s-etcd-[4-6] nodes ([[phab:T238655|T238655]])


=== 2019-11-22 ===
=== 2022-05-26 ===
* 13:32 arturo: created security group `tools-new-k8s-full-connectivity` and add new k8s VMs to it ([[phab:T238654|T238654]])
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 05:55 jeh: add Riley Huntley `riley` to base tools project


=== 2019-11-21 ===
=== 2022-05-22 ===
* 12:48 arturo: reboot the new k8s cluster after the upgrade
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 11:49 arturo: upgrading new k8s kubectl version to 1.15.6 ([[phab:T238654|T238654]])
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko
* 11:44 arturo: upgrading new k8s kubelet version to 1.15.6 ([[phab:T238654|T238654]])
* 10:29 arturo: upgrading new k8s cluster version to 1.15.6 using kubeadm ([[phab:T238654|T238654]])
* 10:28 arturo: install kubeadm 1.15.6 on worker/control nodes in the new k8s cluster ([[phab:T238654|T238654]])


=== 2019-11-19 ===
=== 2022-05-16 ===
* 13:49 arturo: re-create nginx-ingress pod due to deployment template refresh ([[phab:T237643|T237643]])
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 12:46 arturo: deploy changes to tools-prometheus to account for the new k8s cluster ([[phab:T237643|T237643]])


=== 2019-11-15 ===
=== 2022-05-14 ===
* 14:44 arturo: stop live-hacks on tools-prometheus-01 [[phab:T237643|T237643]]
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940


=== 2019-11-13 ===
=== 2022-05-12 ===
* 17:20 arturo: live-hacking tools-prometheus-01 to test some experimental configs for the new k8s cluster ([[phab:T237643|T237643]])
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2019-11-12 ===
=== 2022-05-10 ===
* 12:52 arturo: reboot tools-proxy-06 to reset iptables setup [[phab:T238058|T238058]]
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]


=== 2019-11-10 ===
=== 2022-05-06 ===
* 02:17 bd808: Building new Docker images for [[phab:T237836|T237836]] (retrying after cleaning out old images on tools-docker-builder-06)
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 02:15 bd808: Cleaned up old images on tools-docker-builder-06 using instructions from https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Kubernetes#Building_toolforge_specific_images
* 02:10 bd808: Building new Docker images for [[phab:T237836|T237836]]
* 01:45 bstorm_: deploying bugfix for webservice in tools and toolsbeta [[phab:T237836|T237836]]


=== 2019-11-08 ===
=== 2022-05-05 ===
* 22:47 bstorm_: adding rsync::server::wrap_with_stunnel: false to the tools-docker-registry-03/4 servers to unbreak puppet
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 18:40 bstorm_: pushed new webservice package to the bastions [[phab:T230961|T230961]]
* 18:37 bstorm_: pushed new webservice package supporting buster containers to repo [[phab:T230961|T230961]]
* 18:36 bstorm_: pushed buster-sssd images to the docker repo
* 17:15 phamhi: pushed new buster images with the prefix name "toolforge"


=== 2019-11-07 ===
=== 2022-05-03 ===
* 13:27 arturo: deployed registry-admission-webhook and ingress-admission-controller into the new k8s cluster ([[phab:T236826|T236826]])
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 13:01 arturo: creating puppet prefix `tools-k8s-worker` and a couple of VMs `tools-k8s-worker-[1,2]` [[phab:T236826|T236826]]
* 12:57 arturo: increasing project quota [[phab:T237633|T237633]]
* 11:54 arturo: point `k8s.tools.eqiad1.wikimedia.cloud` to tools-k8s-haproxy-1 [[phab:T236826|T236826]]
* 11:43 arturo: create VMs `tools-k8s-haproxy-[1,2]` [[phab:T236826|T236826]]
* 11:43 arturo: create puppet prefix `tools-k8s-haproxy` [[phab:T236826|T236826]]


=== 2019-11-06 ===
=== 2022-05-02 ===
* 22:32 bstorm_: added rsync::server::wrap_with_stunnel: false to tools-sge-services prefix to fix puppet
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]
* 21:33 bstorm_: docker images needed for kubernetes cluster upgrade deployed [[phab:T215531|T215531]]
* 20:26 bstorm_: building and pushing docker images needed for kubernetes cluster upgrade
* 16:10 arturo: new k8s cluster control nodes are bootstrapped ([[phab:T236826|T236826]])
* 13:51 arturo: created FQDN `k8s.tools.eqiad1.wikimedia.cloud` pointing to `tools-k8s-control-1` for the initial bootstrap ([[phab:T236826|T236826]])
* 13:50 arturo: created 3 VMs`tools-k8s-control-[1,2,3]` ([[phab:T236826|T236826]])
* 13:43 arturo: created `tools-k8s-control` puppet prefix [[phab:T236826|T236826]]
* 11:57 phamhi: restarted all webservices in grid ([[phab:T233347|T233347]])


=== 2019-11-05 ===
=== 2022-04-25 ===
* 23:08 Krenair: Dropped {{Gerrit|59a77a3}}, {{Gerrit|3830802}}, and {{Gerrit|83df61f}} from tools-puppetmaster-01:/var/lib/git/labs/private cherry-picks as these are no longer required [[phab:T206235|T206235]]
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 22:49 Krenair: Disassociated floating IP 185.15.56.60 from tools-static-13, traffic to this host goes via the project-proxy now. DNS was already changed a few days ago. [[phab:T236952|T236952]]
* 14:46 bd808: Building toolforge-webservice v0.82
* 22:35 bstorm_: upgraded libpython3.4 libpython3.4-dbg libpython3.4-minimal libpython3.4-stdlib python3.4 python3.4-dbg python3.4-minimal to fix an old broken patch [[phab:T237468|T237468]]
* 22:12 bstorm_: pushed docker-registry.tools.wmflabs.org/maintain-kubeusers:beta to the registry to deploy in toolsbeta
* 17:38 phamhi: restarted lighttpd based webservice pods on tools-worker-103x and 1040 ([[phab:T233347|T233347]])
* 17:34 phamhi: restarted lighttpd based webservice pods on tools-worker-102[0-9] ([[phab:T233347|T233347]])
* 17:06 phamhi: restarted lighttpd based webservice pods on tools-worker-101[0-9] ([[phab:T233347|T233347]])
* 16:44 phamhi: restarted lighttpd based webservice pods on tools-worker-100[1-9] ([[phab:T233347|T233347]])
* 13:55 arturo: created 3 new VMs: `tools-k8s-etcd-[4,5,6]` [[phab:T236826|T236826]]


=== 2019-11-04 ===
=== 2022-04-23 ===
* 14:45 phamhi: Built and pushed ruby25 docker image based on buster ([[phab:T230961|T230961]])
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 14:45 phamhi: Built and pushed golang111 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed jdk11 docker image based on buster ([[phab:T230961|T230961]])
* 14:45 phamhi: Built and pushed php73 docker image based on buster ([[phab:T230961|T230961]])
* 11:10 phamhi: Built and pushed python37 docker image based on buster ([[phab:T230961|T230961]])


=== 2019-11-01 ===
=== 2022-04-20 ===
* 21:00 Krenair: Removed tools-checker.wmflabs.org A record to 208.80.155.229 as that target IP is in the old pre-neutron range that is no longer routed
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 20:57 Krenair: Removed trusty.tools.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko
* 20:56 Krenair: Removed tools-trusty.wmflabs.org CNAME to login-trusty.tools.wmflabs.org as that target record does not exist, presumably deleted ages ago
* 20:38 Krenair: Updated A record for tools-static.wmflabs.org to point towards project-proxy [[phab:T236952|T236952]]


=== 2019-10-31 ===
=== 2022-04-16 ===
* 18:47 andrewbogott: deleted and/or truncated a bunch of logfiles on tools-worker-1001.  Runaway logfiles filled up the drive which prevented puppet from running.  If puppet had run, it would have prevented the runaway logfiles.
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko
* 13:59 arturo: update puppet prefix `tools-k8s-etcd-` to use the `role::wmcs::toolforge::k8s::etcd` [[phab:T236826|T236826]]
* 13:41 arturo: disabling puppet in tools-k8s-etcd- nodes to test https://gerrit.wikimedia.org/r/c/operations/puppet/+/546995
* 10:15 arturo: SSL cert replacement for tools-docker-registry and tools-k8s-master went fine apparently ([[phab:T236962|T236962]])
* 10:02 arturo: icinga downtime toolschecker for 1h for replacing SSL certs in tools-docker-registry and tools-k8s-master ([[phab:T236962|T236962]])


=== 2019-10-30 ===
=== 2022-04-12 ===
* 13:53 arturo: replacing SSL cert in tools-proxy-x server apparently OK (merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679) [[phab:T235252|T235252]]
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 13:48 arturo: replacing SSL cert in tools-proxy-x server (live-hacking https://gerrit.wikimedia.org/r/c/operations/puppet/+/545679 first for testing) [[phab:T235252|T235252]]
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 13:40 arturo: icinga downtime toolschecker for 1h for replacing SSL cert [[phab:T235252|T235252]]
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2019-10-29 ===
=== 2022-04-10 ===
* 10:49 arturo: deleting VMs tools-test-proxy-01, no longer in use
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
* 10:07 arturo: deleting old jessie VMs tools-proxy-03/04 [[phab:T235627|T235627]]


=== 2019-10-28 ===
=== 2022-04-09 ===
* 16:06 arturo: delete VM instance `tools-test-proxy-01` and the puppet prefix `tools-test-proxy`
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 15:54 arturo: tools-proxy-05 has now the 185.15.56.11 floating IP as active proxy. Old one 185.15.56.6 has been freed [[phab:T235627|T235627]]
* 15:54 arturo: shutting down tools-proxy-03 [[phab:T235627|T235627]]
* 15:26 bd808: Killed all processes owned by jem on tools-sgebastion-08
* 15:16 arturo: tools-proxy-05 has now the 185.15.56.5 floating IP as active proxy [[phab:T235627|T235627]]
* 15:14 arturo: refresh hiera to use tools-proxy-05 as active proxy [[phab:T235627|T235627]]
* 15:11 bd808: Killed ircbot.php processes started by jem on tools-sgebastion-08 per request on irc
* 14:58 arturo: added `webproxy` security group to tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:57 phamhi: drained tools-worker-1031.tools.eqiad.wmflabs to clean up disk space
* 14:45 arturo: created VMs tools-proxy-05 and tools-proxy-06 ([[phab:T235627|T235627]])
* 14:43 arturo: adding `role::wmcs::toolforge::proxy` to the `tools-proxy` puppet prefix ([[phab:T235627|T235627]])
* 14:42 arturo: deleted `role::toollabs::proxy` from the `tools-proxy` puppet profile ([[phab:T235627|T235627]])
* 14:34 arturo: icinga downtime toolschecker for 1h ([[phab:T235627|T235627]])
* 12:25 arturo: upload image `coredns` v1.3.1 ({{Gerrit|eb516548c180}}) to docker registry ([[phab:T236249|T236249]])
* 12:23 arturo: upload image `kube-apiserver` v1.15.1 ({{Gerrit|68c3eb07bfc3}}) to docker registry ([[phab:T236249|T236249]])
* 12:22 arturo: upload image `kube-controller-manager` v1.15.1 ({{Gerrit|d75082f1d121}}) to docker registry ([[phab:T236249|T236249]])
* 12:20 arturo: upload image `kube-proxy` v1.15.1 ({{Gerrit|89a062da739d}}) to docker registry ([[phab:T236249|T236249]])
* 12:19 arturo: upload image `kube-scheduler` v1.15.1 ({{Gerrit|b0b3c4c404da}}) to docker registry ([[phab:T236249|T236249]])
* 12:04 arturo: upload image `calico/node` v3.8.0 ({{Gerrit|cd3efa20ff37}}) to docker registry ([[phab:T236249|T236249]])
* 12:03 arturo: upload image `calico/calico/pod2daemon-flexvol` v3.8.0 ({{Gerrit|f68c8f870a03}}) to docker registry ([[phab:T236249|T236249]])
* 12:01 arturo: upload image `calico/cni` v3.8.0 ({{Gerrit|539ca36a4c13}}) to docker registry ([[phab:T236249|T236249]])
* 11:58 arturo: upload image `calico/kube-controllers` v3.8.0 ({{Gerrit|df5ff96cd966}}) to docker registry ([[phab:T236249|T236249]])
* 11:47 arturo: upload image `nginx-ingress-controller` v0.25.1 ({{Gerrit|0439eb3e11f1}}) to docker registry ([[phab:T236249|T236249]])


=== 2019-10-24 ===
=== 2022-04-08 ===
* 16:32 bstorm_: set the prod rsyslog config for kubernetes to false for Toolforge
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component


=== 2019-10-23 ===
=== 2022-04-05 ===
* 20:00 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.47 ([[phab:T233347|T233347]])
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 12:09 phamhi: Deployed toollabs-webservice 0.47 to buster-tools and stretch-tools ([[phab:T233347|T233347]])
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 09:13 arturo: 9 tools-sgeexec nodes and 6 other related VMs are down because hypervisor is rebooting
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7
* 09:03 arturo: tools-sgebastion-08 is down because hypervisor is rebooting


=== 2019-10-22 ===
=== 2022-04-04 ===
* 16:56 bstorm_: drained tools-worker-1025.tools.eqiad.wmflabs which was malfunctioning
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:25 arturo: created the `tools.eqiad1.wikimedia.cloud.` DNS zone
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions


=== 2019-10-21 ===
=== 2022-03-28 ===
* 17:32 phamhi: Rebuilding all jessie and stretch docker images to pick up toollabs-webservice 0.46
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo


=== 2019-10-18 ===
=== 2022-03-15 ===
* 22:15 bd808: Rescheduled continuous jobs away from tools-sgeexec-0904 because of high system load
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 22:09 bd808: Cleared error state of webgrid-generic@tools-sgewebgrid-generic-0901, webgrid-lighttpd@tools-sgewebgrid-lighttpd-09{12,15,19,20,26}
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)
* 21:29 bd808: Rescheduled all grid engine webservice jobs ([[phab:T217815|T217815]])


=== 2019-10-16 ===
=== 2022-03-14 ===
* 16:21 phamhi: Deployed toollabs-webservice 0.46 to buster-tools and stretch-tools ([[phab:T218461|T218461]])
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 09:29 arturo: toolforge is recovered from the reboot of cloudvirt1029
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])
* 09:17 arturo: due to the reboot of cloudvirt1029, several sgeexec nodes (8) are offline, also sgewebgrid-lighttpd (8) and tools-worker (3) and the main toolforge proxy (tools-proxy-03)


=== 2019-10-15 ===
=== 2022-03-10 ===
* 17:10 phamhi: restart tools-worker-1035 because it is no longer responding
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902


=== 2019-10-14 ===
=== 2022-03-01 ===
* 09:26 arturo: cleaned-up updatetools from tools-sge-services nodes ([[phab:T229261|T229261]])
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2019-10-11 ===
=== 2022-02-28 ===
* 19:52 bstorm_: restarted docker on tools-docker-builder after phamhi noticed the daemon had a routing issue (blank iptables)
* 08:02 taavi: reboot sgeexec-0916
* 11:55 arturo: create tools-test-proxy-01 VM for testing [[phab:T235059|T235059]] and a puppet prefix for it
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 10:53 arturo: added kubernetes-node_1.4.6-7_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:51 arturo: added docker-engine_1.12.6-0~debian-jessie_amd64.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]
* 10:46 arturo: added logster_0.0.10-2~jessie1_all.deb to buster-tools and buster-toolsbeta (aptly) for [[phab:T235059|T235059]]


=== 2019-10-10 ===
=== 2022-02-17 ===
* 02:33 bd808: Rebooting tools-sgewebgrid-lighttpd-0903. Instance hung.
* 08:23 taavi: deleted tools-clushmaster-02
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access


=== 2019-10-09 ===
=== 2022-02-16 ===
* 22:52 jeh: removing test instances tools-sssd-sgeexec-test-[12] from SGE
* 00:12 bd808: Image builds completed.
* 15:32 phamhi: drained tools-worker-1020/23/33/35/36/40 to rebalance the cluster
* 14:46 phamhi: drained and cordoned tools-worker-1029 after status reset on reboot
* 12:37 arturo: drain tools-worker-1038 to rebalance load in the k8s cluster
* 12:35 arturo: uncordon tools-worker-1029 (was disabled for unknown reasons)
* 12:33 arturo: drain tools-worker-1010 to rebalance load
* 10:33 arturo: several sgewebgrid-lighttpd nodes (9) not available because cloudvirt1013 is rebooting
* 10:21 arturo: several worker nodes (7) not available because cloudvirt1012 is rebooting
* 10:08 arturo: several worker nodes (6) not available because cloudvirt1009 is rebooting
* 09:59 arturo: several worker nodes (5) not available because cloudvirt1008 is rebooting


=== 2019-10-08 ===
=== 2022-02-15 ===
* 19:40 bstorm_: drained tools-worker-1007/8 to rebalance the cluster
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 19:34 bstorm_: drained tools-worker-1009 and then 1014 for rebalancing
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 19:27 bstorm_: drained tools-worker-1005 for rebalancing (and put these back in service as I went)
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 19:24 bstorm_: drained tools-worker-1003 and 1009 for rebalancing
* 22:50 bd808: Built new toollabs-webservice 0.81
* 15:41 arturo: deleted VM instance tools-sgebastion-0test. No longer in use.
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2019-10-07 ===
=== 2022-02-10 ===
* 20:17 bd808: Dropped backlog of messages for delivery to tools.usrd-tools
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 20:16 bd808: Dropped backlog of messages for delivery to tools.mix-n-match
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 20:13 bd808: Dropped backlog of frozen messages for delivery (240 dropped)
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 19:25 bstorm_: deleted tools-puppetmaster-02
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 19:20 Krenair: reboot tools-k8s-master-01 due to nfs stale issue
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 19:18 Krenair: reboot tools-paws-worker-1006 due to nfs stale issue
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 19:16 phamhi: reboot tools-worker-1040 due to nfs stale issue
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 19:16 phamhi: reboot tools-worker-1039 due to nfs stale issue
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 19:16 phamhi: reboot tools-worker-1038 due to nfs stale issue
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 19:16 phamhi: reboot tools-worker-1037 due to nfs stale issue
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 19:16 phamhi: reboot tools-worker-1036 due to nfs stale issue
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]
* 19:16 phamhi: reboot tools-worker-1035 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1034 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1033 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1032 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1031 due to nfs stale issue
* 19:15 phamhi: reboot tools-worker-1030 due to nfs stale issue
* 19:10 Krenair: reboot tools-puppetmaster-02 due to nfs stale issue
* 19:09 Krenair: reboot tools-sgebastion-0test due to nfs stale issue
* 19:08 Krenair: reboot tools-sgebastion-09 due to nfs stale issue
* 19:08 Krenair: reboot tools-sge-services-04 due to nfs stale issue
* 19:07 Krenair: reboot tools-paws-worker-1002 due to nfs stale issue
* 19:06 Krenair: reboot tools-mail-02 due to nfs stale issue
* 19:06 Krenair: reboot tools-docker-registry-03 due to nfs stale issue
* 19:04 Krenair: reboot tools-worker-1029 due to nfs stale issue
* 19:00 Krenair: reboot tools-static-12 tools-docker-registry-04 and tools-clushmaster-02 due to NFS stale issue
* 18:55 phamhi: reboot tools-worker-1028 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1027 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1026 due to nfs stale issue
* 18:55 phamhi: reboot tools-worker-1025 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1023 due to nfs stale issue
* 18:47 phamhi: reboot tools-worker-1022 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1021 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1020 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1019 due to nfs stale issue
* 18:46 phamhi: reboot tools-worker-1018 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1017 due to nfs stale issue
* 18:34 phamhi: reboot tools-worker-1016 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1015 due to nfs stale issue
* 18:32 phamhi: reboot tools-worker-1014 due to nfs stale issue
* 18:23 phamhi: reboot tools-worker-1013 due to nfs stale issue
* 18:21 phamhi: reboot tools-worker-1012 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1011 due to nfs stale issue
* 18:12 phamhi: reboot tools-worker-1010 due to nfs stale issue
* 18:08 phamhi: reboot tools-worker-1009 due to nfs stale issue
* 18:07 phamhi: reboot tools-worker-1008 due to nfs stale issue
* 17:58 phamhi: reboot tools-worker-1007 due to nfs stale issue
* 17:57 phamhi: reboot tools-worker-1006 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1005 due to nfs stale issue
* 17:47 phamhi: reboot tools-worker-1004 due to nfs stale issue
* 17:43 phamhi: reboot tools-worker-1002.tools.eqiad.wmflabs due to nfs stale issue
* 17:35 phamhi: drained and uncordoned tools-worker-100[1-5]
* 17:32 bstorm_: reboot tools-sgewebgrid-lighttpd-0912
* 17:30 bstorm_: reboot tools-sgewebgrid-lighttpd-0923/24/08
* 17:01 bstorm_: rebooting tools-sgegrid-master and tools-sgegrid-shadow 😭
* 16:58 bstorm_: rebooting tools-sgewebgrid-lighttpd-0902/4/6/7/8/19
* 16:53 bstorm_: rebooting tools-sgewebgrid-generic-0902/4
* 16:50 bstorm_: rebooting tools-sgeexec-0915/18/19/23/26
* 16:49 bstorm_: rebooting tools-sgeexec-0901 and tools-sgeexec-0909/10/11
* 16:46 bd808: `sudo shutdown -r now` for tools-sgebastion-08
* 16:41 bstorm_: reboot tools-sgebastion-07
* 16:39 bd808: `sudo service nslcd restart` on tools-sgebastion-08


=== 2019-10-04 ===
=== 2022-02-09 ===
* 21:43 bd808: `sudo exec-manage repool tools-sgeexec-0923.tools.eqiad.wmflabs`
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 21:26 bd808: Rebooting tools-sgeexec-0923 after lots of messing about with a broken update-initramfs build
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 20:35 bd808: Manually running `/usr/bin/python3 /usr/bin/unattended-upgrade` on tools-sgeexec-0923
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 20:33 bd808: Killed 2 /usr/bin/unattended-upgrade procs on tools-sgeexec-0923 that seemed stuck
* 18:25 arturo: ignore last message
* 13:33 arturo: remove /etc/init.d/rsyslog on tools-worker-XXXX nodes so the rsyslog deb prerm script doesn't prevent the package from being updated
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2019-10-03 ===
=== 2022-02-07 ===
* 13:05 arturo: delete servers tools-sssd-sgeexec-test-[1,2], no longer required
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]


=== 2019-09-27 ===
=== 2022-02-04 ===
* 16:59 bd808: Set "profile::rsyslog::kafka_shipper::kafka_brokers: []" in tools-elastic prefix puppet
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 00:40 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0927
* 21:36 taavi: clear error state from some webgrid nodes


=== 2019-09-25 ===
=== 2022-02-03 ===
* 19:08 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1021
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate


=== 2019-09-23 ===
=== 2022-01-30 ===
* 16:58 bstorm_: deployed tools-manifest 0.20 and restarted webservicemonitor
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 06:01 bd808: Restarted maintain-dbusers process on labstore1004. ([[phab:T233530|T233530]])
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]


=== 2019-09-12 ===
=== 2022-01-26 ===
* 20:48 phamhi: Deleted tools-puppetdb-01.tools as it is no longer in used
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2019-09-11 ===
=== 2022-01-25 ===
* 13:30 jeh: restart tools-sgeexec-0912
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2019-09-09 ===
=== 2022-01-24 ===
* 22:44 bstorm_: uncordoned tools-worker-1030 and tools-worker-1038
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2019-09-06 ===
=== 2022-01-20 ===
* 15:11 bd808: `sudo kill -9 10635` on tools-k8s-master-01 ([[phab:T194859|T194859]])
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2019-09-05 ===
=== 2022-01-19 ===
* 21:02 bd808: Enabled Puppet on tools-docker-registry-03 and forced puppet run ([[phab:T232135|T232135]])
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move
* 18:13 bd808: Disabled Puppet on tools-docker-registry-03 to investigate docker-registry issue (no phab task yet)


=== 2019-09-01 ===
=== 2022-01-14 ===
* 20:51 Reedy: `sudo service maintain-kubeusers restart` on tools-k8s-master-01
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]


=== 2019-08-30 ===
=== 2022-01-12 ===
* 16:54 phamhi: restart maintain-kuberusers service in tools-k8s-master-01
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 16:21 bstorm_: depooling tools-sgewebgrid-lighttpd-0923 to reboot -- high iowait likely from NFS mounts
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2019-08-29 ===
=== 2022-01-04 ===
* 22:18 bd808: Finished building new stretch Docker images for Toolforge Kubernetes use
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 22:06 bd808: Starting process of building new stretch Docker images for Toolforge Kubernetes use
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
* 22:05 bd808: Jessie Docker image rebuild complete
* 21:31 bd808: Starting process of building new jessie Docker images for Toolforge Kubernetes use
 
=== 2019-08-27 ===
* 19:10 bd808: Restarted maintain-kubeusers after complaint on irc. It was stuck in limbo again
 
=== 2019-08-26 ===
* 21:48 bstorm_: repooled tools-sgewebgrid-generic-0902, tools-sgewebgrid-lighttpd-0902, tools-sgewebgrid-lighttpd-0903 and tools-sgeexec-0905
 
=== 2019-08-18 ===
* 08:11 arturo: restart maintain-kuberusers service in tools-k8s-master-01
 
=== 2019-08-17 ===
* 10:56 arturo: force-reboot tools-worker-1006. Is completely stuck
 
=== 2019-08-15 ===
* 15:32 jeh: upgraded jobutils debian package to 1.38 [[phab:T229551|T229551]]
* 09:22 arturo: restart maintain-kubeusers service in tools-k8s-master-01 because some tools were missing their namespaces
 
=== 2019-08-13 ===
* 22:00 bstorm_: truncated exim paniclog on tools-sgecron-01 because it was being spammy
* 13:41 jeh: Set icingia downtime for toolschecker labs showmount [[phab:T229448|T229448]]
 
=== 2019-08-12 ===
* 16:08 phamhi: updated prometheus-node-exporter from 0.14.0~git20170523-1 to 0.17.0+ds-3 in tools-worker-[1030-1040] nodes ([[phab:T230147|T230147]])
 
=== 2019-08-08 ===
* 19:26 jeh: restarting tools-sgewebgrid-lighttpd-0915 [[phab:T230157|T230157]]
 
=== 2019-08-07 ===
* 19:07 bd808: Disassociated SUL and Phabricator accounts from user Lophi ([[phab:T229713|T229713]])
 
=== 2019-08-06 ===
* 16:18 arturo: add phamhi as user/projectadmin ([[phab:T228942|T228942]]) and delete hpham
* 15:59 arturo: add hpham as user/projectadmin ([[phab:T228942|T228942]])
* 13:44 jeh: disabling puppet on tools-checker-03 while testing nginx timeouts [[phab:T221301|T221301]]
 
=== 2019-08-05 ===
* 22:49 bstorm_: launching tools-worker-1040
* 20:36 andrewbogott: rebooting oom tools-worker-1026
* 16:10 jeh: `tools-k8s-master-01: systemctl restart maintain-kubeusers` [[phab:T229846|T229846]]
* 09:39 arturo: `root@tools-checker-03:~# toolscheckerctl restart` again ([[phab:T229787|T229787]])
* 09:30 arturo: `root@tools-checker-03:~# toolscheckerctl restart` ([[phab:T229787|T229787]])
 
=== 2019-08-02 ===
* 14:00 andrewbogott_: rebooting tools-worker-1022 as it is unresponsive
 
=== 2019-07-31 ===
* 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
* 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
* 17:32 bstorm_: drained tools-worker-1028 to rebalance load
* 17:29 bstorm_: drained tools-worker-1008 to rebalance load
* 17:23 bstorm_: drained tools-worker-1021 to rebalance load
* 17:17 bstorm_: drained tools-worker-1007 to rebalance load
* 17:07 bstorm_: drained tools-worker-1004 to rebalance load
* 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
* 15:33 bstorm_: [[phab:T228573|T228573]] spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)
 
=== 2019-07-27 ===
* 23:00 zhuyifei1999_: a past probably related ticket: [[phab:T194859|T194859]]
* 22:57 zhuyifei1999_: maintain-kubeusers seems stuck. Traceback: https://phabricator.wikimedia.org/P8812, core dump: /root/core.17898. Restarting
 
=== 2019-07-26 ===
* 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
* 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
* 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
* 16:32 bstorm_: created tools-worker-1034 - [[phab:T228573|T228573]]
* 15:57 bstorm_: created tools-worker-1032 and 1033 - [[phab:T228573|T228573]]
* 15:55 bstorm_: created tools-worker-1031 - [[phab:T228573|T228573]]
 
=== 2019-07-25 ===
* 22:01 bstorm_: [[phab:T228573|T228573]] created tools-worker-1030
* 21:22 jeh: rebooting tools-worker-1016 unresponsive
 
=== 2019-07-24 ===
* 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
* 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 ([[phab:T227539|T227539]])
 
=== 2019-07-22 ===
* 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
* 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
* 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
* 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
* 17:55 bstorm_: draining tools-worker-1023 since it is having issues
* 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats [[phab:T228573|T228573]]
 
=== 2019-07-20 ===
* 19:52 andrewbogott: rebooting tools-worker-1023
 
=== 2019-07-17 ===
* 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014
 
=== 2019-07-15 ===
* 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job {{Gerrit|5190035}}
 
=== 2019-06-25 ===
* 09:30 arturo: detected puppet issue in all VMs: [[phab:T226480|T226480]]
 
=== 2019-06-24 ===
* 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015
 
=== 2019-06-17 ===
* 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
* 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: [[phab:T220853|T220853]] )
 
=== 2019-06-11 ===
* 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs
 
=== 2019-06-05 ===
* 18:33 andrewbogott: repooled  tools-sgeexec-0921 and tools-sgeexec-0929
* 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929
 
=== 2019-05-30 ===
* 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
* 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
* 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
* 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
* 12:35 arturo: enable puppet in tools-worker nodes
* 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because [[phab:T224651|T224651]] ([[phab:T224558|T224558]])
* 12:25 arturo: cordon/drain tools-worker-1002 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:23 arturo: cordon/drain tools-worker-1001 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:22 arturo: cordon/drain tools-worker-1029 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 12:20 arturo: cordon/drain tools-worker-1003 because [[phab:T224651|T224651]] and [[phab:T224651|T224651]]
* 11:59 arturo: [[phab:T224558|T224558]] repool tools-worker-1003 (using sssd/sudo now!)
* 11:23 arturo: [[phab:T224558|T224558]] depool tools-worker-1003
* 10:48 arturo: [[phab:T224558|T224558]] drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
* 10:33 arturo: [[phab:T224558|T224558]] switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:28 arturo: [[phab:T224558|T224558]] use hiera config in prefix tools-worker for sssd/sudo
* 10:27 arturo: [[phab:T224558|T224558]] switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
* 10:09 arturo: [[phab:T224558|T224558]] disable puppet in all tools-worker- nodes
* 10:01 arturo: [[phab:T224558|T224558]] add tools-worker-1029 to the nodes pool of k8s
* 09:58 arturo: [[phab:T224558|T224558]] reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie
 
=== 2019-05-29 ===
* 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
* 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes ([[phab:T221225|T221225]])
* 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
* 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning
 
=== 2019-05-28 ===
* 18:15 arturo: [[phab:T221225|T221225]] for the record, tools-worker-1001 is not working after trying with sssd
* 18:13 arturo: [[phab:T221225|T221225]] created tools-worker-1029 to test sssd/sudo stuff
* 17:49 arturo: [[phab:T221225|T221225]] repool tools-worker-1002 (using nscd/nslcd and sudoldap)
* 17:44 arturo: [[phab:T221225|T221225]] back to classic/ldap hiera config in the tools-worker puppet prefix
* 17:35 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001 again
* 17:27 arturo: [[phab:T221225|T221225]] hard reboot tools-worker-1001
* 17:12 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1002
* 17:09 arturo: [[phab:T221225|T221225]] depool & switch to sssd/sudo & reboot & repool tools-worker-1001
* 17:08 arturo: [[phab:T221225|T221225]] switch to sssd/sudo in puppet prefix for tools-worker
* 13:04 arturo: [[phab:T221225|T221225]] depool and rebooted tools-worker-1001 in preparation for sssd migration
* 12:39 arturo: [[phab:T221225|T221225]] disable puppet in all tools-worker nodes in preparation for sssd
* 12:32 arturo: drop the tools-bastion puppet prefix, unused
* 12:31 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
* 12:27 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
* 12:16 arturo: [[phab:T221225|T221225]] set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
* 11:26 arturo: merged change to the sudo module to allow sssd transition
 
=== 2019-05-27 ===
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
* 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%
 
=== 2019-05-21 ===
* 12:35 arturo: [[phab:T223992|T223992]] rebooting tools-redis-1002
 
=== 2019-05-20 ===
* 11:25 arturo: [[phab:T223332|T223332]] enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
* 10:53 arturo: [[phab:T223332|T223332]] disable puppet agent in tools-k8s-master and tools-docker-registry nodes
 
=== 2019-05-18 ===
* 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image ([[phab:T217908|T217908]])
* 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45
 
=== 2019-05-17 ===
* 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
* 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)
 
=== 2019-05-16 ===
* 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
* 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time
 
=== 2019-05-15 ===
* 16:20 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-0921 and -0929
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0921 and move to cloudvirt1014
* 15:32 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and move to cloudvirt1014
* 12:29 arturo: [[phab:T223148|T223148]] repool both tools-sgeexec-09[37,39]
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0937 and move to cloudvirt1008
* 12:13 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0939 and move to cloudvirt1007
* 11:34 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0940
* 11:20 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0940 and move to cloudvirt1006
* 11:11 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0941
* 10:46 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0941 and move to cloudvirt1005
* 09:44 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0901
* 09:00 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0901 and reallocate to cloudvirt1004
 
=== 2019-05-14 ===
* 17:12 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0920
* 16:37 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0920 and reallocate to cloudvirt1003
* 16:36 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0911
* 15:56 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0911 and reallocate to cloudvirt1003
* 15:52 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0909
* 15:24 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0909 and reallocate to cloudvirt1002
* 15:24 arturo: [[phab:T223148|T223148]] last SAL entry is bogus, please ignore (depool tools-worker-1009)
* 15:23 arturo: [[phab:T223148|T223148]] depool tools-worker-1009
* 15:13 arturo: [[phab:T223148|T223148]] repool tools-worker-1023
* 13:16 arturo: [[phab:T223148|T223148]] repool tools-sgeexec-0942
* 13:03 arturo: [[phab:T223148|T223148]] repool tools-sgewebgrid-generic-0904
* 12:58 arturo: [[phab:T223148|T223148]] reallocating tools-worker-1023 to cloudvirt1001
* 12:56 arturo: [[phab:T223148|T223148]] depool tools-worker-1023
* 12:52 arturo: [[phab:T223148|T223148]] reallocating tools-sgeexec-0942 to cloudvirt1001
* 12:50 arturo: [[phab:T223148|T223148]] depool tools-sgeexec-0942
* 12:49 arturo: [[phab:T223148|T223148]] reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
* 12:43 arturo: [[phab:T223148|T223148]] depool tools-sgewebgrid-generic-0904
 
=== 2019-05-13 ===
* 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs
 
=== 2019-05-07 ===
* 14:38 arturo: [[phab:T222718|T222718]] uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
* 14:31 arturo: [[phab:T222718|T222718]] reboot tools-worker-1009 and 1022 after being drained
* 14:28 arturo: k8s drain tools-worker-1009 and 1022
* 11:46 arturo: [[phab:T219362|T219362]] enable puppet in tools-redis servers and use the new puppet role
* 11:33 arturo: [[phab:T219362|T219362]] disable puppet in tools-reds servers for puppet code cleanup
* 11:12 arturo: [[phab:T219362|T219362]] drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
* 11:10 arturo: [[phab:T219362|T219362]] enable puppet in tools-static servers and use new puppet role
* 11:01 arturo: [[phab:T219362|T219362]] disable puppet in tools-static servers for puppet code cleanup
* 10:16 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-lighttpd` puppet prefix
* 10:14 arturo: [[phab:T219362|T219362]] drop the `tools-webgrid-generic` puppet prefix
* 10:06 arturo: [[phab:T219362|T219362]] drop the `tools-exec-1` puppet prefix
 
=== 2019-05-06 ===
* 11:34 arturo: [[phab:T221225|T221225]] reenable puppet
* 10:53 arturo: [[phab:T221225|T221225]] disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)
 
=== 2019-05-03 ===
* 09:43 arturo: fixed puppet in tools-puppetdb-01 too
* 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
* 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
* 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
* 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
* 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
* 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package
 
=== 2019-04-30 ===
* 12:50 arturo: enable puppet in all servers [[phab:T221225|T221225]]
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd ([[phab:T221225|T221225]])
* 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
* 11:07 arturo: [[phab:T221225|T221225]] disable puppet in toolforge
* 10:56 arturo: [[phab:T221225|T221225]] create tools-sgebastion-0test for more sssd tests
 
=== 2019-04-29 ===
* 11:22 arturo: [[phab:T221225|T221225]] re-enable puppet agent in all toolforge servers
* 10:27 arturo: [[phab:T221225|T221225]] reboot tool-sgebastion-09 for testing sssd
* 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test [[phab:T221225|T221225]]
* 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages
 
=== 2019-04-26 ===
* 12:20 andrewbogott: rescheduling every pod everywhere
* 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs
 
=== 2019-04-25 ===
* 12:49 arturo: [[phab:T221225|T221225]] using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
* 11:43 arturo: [[phab:T221793|T221793]] removing prometheus crontab and letting puppet agent re-create it again to resolve staleness
 
=== 2019-04-24 ===
* 12:54 arturo: puppet broken, fixing right now
* 09:18 arturo: [[phab:T221225|T221225]] reallocating tools-sgebastion-09 to cloudvirt1008
 
=== 2019-04-23 ===
* 15:26 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-08 to cleanup sssd
* 15:19 arturo: [[phab:T221225|T221225]] creating tools-sgebastion-09 for testing sssd stuff
* 13:06 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
* 12:57 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
* 10:28 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
* 10:27 arturo: [[phab:T221225|T221225]] rebooting tools-sgebastion-07 to clean sssd confiuration
* 10:16 arturo: [[phab:T221225|T221225]] disable puppet in tools-sgebastion-08 for sssd testing
* 09:49 arturo: [[phab:T221225|T221225]] run puppet agent in the bastions and reboot them with sssd
* 09:43 arturo: [[phab:T221225|T221225]] use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
* 09:41 arturo: [[phab:T221225|T221225]] disable puppet agent in the bastions
 
=== 2019-04-17 ===
* 12:09 arturo: [[phab:T221225|T221225]] rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
* 11:59 arturo: [[phab:T221205|T221205]] sssd was deployed successfully into all webgrid nodes
* 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
* 11:31 arturo: reboot bastions for sssd deployment
* 11:30 arturo: deploy sssd to bastions
* 11:24 arturo: disable puppet in bastions to deploy sssd
* 09:52 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
* 09:45 arturo: [[phab:T221205|T221205]] tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
* 09:12 arturo: [[phab:T221205|T221205]] start deploying sssd to sgewebgrid nodes
* 09:00 arturo: [[phab:T221205|T221205]] add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
* 08:57 arturo: [[phab:T221205|T221205]] disable puppet in all tools-sgewebgrid-* nodes
 
=== 2019-04-16 ===
* 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
* 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
* 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r
 
=== 2019-04-15 ===
* 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
* 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r
 
=== 2019-04-14 ===
* 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them
 
=== 2019-04-13 ===
* 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for [[phab:T220853|T220853]]
* 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for [[phab:T220853|T220853]]
* 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 [[phab:T220853|T220853]]
* 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 [[phab:T220853|T220853]]
 
=== 2019-04-11 ===
* 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
* 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
* 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
* 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
* 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
* 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
* 15:40 andrewbogott: moving tools-redis-1002  to eqiad1-r
* 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
* 12:01 arturo: [[phab:T151704|T151704]] deploying oidentd
* 11:54 arturo: disable puppet in all hosts to deploy oidentd
* 02:33 andrewbogott: tools-paws-worker-1005,  tools-paws-worker-1006 to eqiad1-r
* 00:03 andrewbogott: tools-paws-worker-1002,  tools-paws-worker-1003 to eqiad1-r
 
=== 2019-04-10 ===
* 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
* 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
* 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
* 14:49 bstorm_: cleared E state from 5 queues
* 13:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0906
* 12:31 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0926
* 12:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0925
* 12:06 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0901
* 11:55 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0924
* 11:47 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0921
* 11:23 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0940
* 11:03 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0928
* 10:49 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0923
* 10:43 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0915
* 10:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0935
* 10:19 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0914
* 10:02 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0907
* 09:41 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0918
* 09:27 arturo: [[phab:T218126|T218126]] hard reboot tools-sgeexec-0932
* 09:26 arturo: [[phab:T218216|T218216]] hard reboot tools-sgeexec-0932
* 09:04 arturo: [[phab:T218216|T218216]] add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
* 09:03 arturo: [[phab:T218216|T218216]] do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
* 08:39 arturo: [[phab:T218216|T218216]] disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
* 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r
 
=== 2019-04-09 ===
* 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
* 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
* 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
* 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
* 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
* 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
* 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
* 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
* 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
* 17:05 andrewbogott: migrating  tools-k8s-etcd-01 to eqiad1-r
* 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
* 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
* 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
* 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
* 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get  the k8s node moves to register
 
=== 2019-04-08 ===
* 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
* 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r
 
=== 2019-04-07 ===
* 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
* 01:06 bstorm_: cleared E state from 6 queues
 
=== 2019-04-05 ===
* 15:44 bstorm_: cleared E state from two exec queues
 
=== 2019-04-04 ===
* 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
* 20:53 bd808: Rebooting tools-worker-1013
* 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
* 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
* 20:28 bd808: Shutdown tools-checker-01 via Horizon
* 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
* 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
* 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
* 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
* 20:03 bstorm_: depooled  tools-webgrid-lighttpd-0912
* 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
* 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
* 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
* 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
* 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
* 19:13 bstorm_: cleared E state from 7 queues
* 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host
 
=== 2019-04-03 ===
* 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already
 
=== 2019-04-02 ===
* 12:11 arturo: icinga downtime toolschecker for 1 month [[phab:T219243|T219243]]
* 03:55 bd808: Added etcd service group to tools-k8s-etcd-* ([[phab:T219243|T219243]])
 
=== 2019-04-01 ===
* 19:44 bd808: Deleted tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 19:43 bd808: Shutdown tools-checker-02 via Horizon ([[phab:T219243|T219243]])
* 16:53 bstorm_: cleared E state on 6 grid queues
* 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)
 
=== 2019-03-29 ===
* 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
* 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 ([[phab:T219243|T219243]])
* 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
* 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker ([[phab:T219243|T219243]])
* 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing ([[phab:T219243|T219243]])
* 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier ([[phab:T219243|T219243]])
* 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 sudo qmod -cj` on tools-sgegrid-master
* 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
* 17:11 bd808: Restarted nginx on tools-static-13
* 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
* 16:49 bstorm_: cleared E state from 21 queues
* 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
* 13:54 andrewbogott: moving tools-static-13 to eqiad1-r
 
=== 2019-03-28 ===
* 01:00 bstorm_: cleared error states from two queues
* 00:23 bstorm_: [[phab:T216060|T216060]] created tools-sgewebgrid-generic-0901...again!
 
=== 2019-03-27 ===
* 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue [[phab:T219460|T219460]]
* 14:45 bstorm_: cleared several "E" state queues
* 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
* 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
* 12:15 arturo: [[phab:T218126|T218126]] `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)
 
=== 2019-03-26 ===
* 22:00 gtirloni: downtimed toolschecker
* 17:31 arturo: [[phab:T218126|T218126]] create VM instances tools-sssd-sgeexec-test-[12]
* 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
* 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org
 
=== 2019-03-25 ===
* 21:21 bd808: All Trusty grid engine hosts shutdown and deleted ([[phab:T217152|T217152]])
* {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow}  (T217152)}}
* 21:18 bd808: Deleted tools-webgrid-lighttpd-14*  ([[phab:T217152|T217152]])
* 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
* 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
* 20:51 bd808: Deleted tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-143* ([[phab:T217152|T217152]])
* 20:49 bd808: Deleted tools-exec-142* ([[phab:T217152|T217152]])
* 20:48 bd808: Deleted tools-exec-141* ([[phab:T217152|T217152]])
* 20:47 bd808: Deleted tools-exec-140* ([[phab:T217152|T217152]])
* 20:43 bd808: Deleted  tools-cron-01 ([[phab:T217152|T217152]])
* 20:42 bd808: Deleted tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
* 19:59 bd808: Shutdown tools-exec-143* ([[phab:T217152|T217152]])
* 19:51 bd808: Shutdown tools-exec-142* ([[phab:T217152|T217152]])
* 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
* 19:33 bd808: Shutdown tools-exec-141* ([[phab:T217152|T217152]])
* 19:31 bd808: Shutdown tools-bastion-0{2,3} ([[phab:T217152|T217152]])
* 19:19 bd808: Shutdown tools-exec-140* ([[phab:T217152|T217152]])
* 19:12 bd808: Shutdown tools-webgrid-generic-14* ([[phab:T217152|T217152]])
* 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-master ([[phab:T217152|T217152]])
* 18:53 bd808: Shutdown tools-grid-shadow ([[phab:T217152|T217152]])
* 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
* 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
* 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
* 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
* 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} ([[phab:T217152|T217152]])
* 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} ([[phab:T217152|T217152]])
 
=== 2019-03-22 ===
* 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
* 16:12 bstorm_: cleared errored out stretch grid queues
* 15:56 bd808: Rebooting tools-static-12
* 03:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted 15 other nodes.  Entire stretch grid is in a good state for now.
* 02:31 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
* 02:09 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0924
* 00:39 bstorm_: [[phab:T217280|T217280]] depooled and rebooted tools-sgewebgrid-lighttpd-0902
 
=== 2019-03-21 ===
* 23:28 bstorm_: [[phab:T217280|T217280]] depooled, reloaded and repooled tools-sgeexec-0938
* 21:53 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
* 21:51 bstorm_: [[phab:T217280|T217280]] rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
* 21:26 bstorm_: [[phab:T217280|T217280]] cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related
 
=== 2019-03-18 ===
* 18:43 bd808: Rebooting tools-static-12
* 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01{{!}}07{{!}}10)` all else working
* 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
* 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
* 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down
 
=== 2019-03-17 ===
* 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for [[phab:T218494|T218494]]
* 22:30 bd808: Investigating strange system state on tools-bastion-03.
* 17:48 bstorm_: [[phab:T218514|T218514]] rebooting tools-worker-1009 and 1012
* 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for [[phab:T218514|T218514]]
* 17:13 bstorm_: depooled and rebooting tools-worker-1018
* 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs
 
=== 2019-03-16 ===
* 22:34 bstorm_: clearing errored out queues again
 
=== 2019-03-15 ===
* 21:08 bstorm_: cleared error state on several queues [[phab:T217280|T217280]]
* 15:58 gtirloni: rebooted tools-clushmaster-02
* 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - [[phab:T130532|T130532]]
* 14:32 mutante: tools-sgebastion-07 - generating locales for user request in [[phab:T130532|T130532]]
 
=== 2019-03-14 ===
* 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} ([[phab:T217152|T217152]])
* 23:28 bd808: Deleted tools-bastion-05 ([[phab:T217152|T217152]])
* 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
* 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} ([[phab:T217152|T217152]])
* 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon ([[phab:T217152|T217152]])
* 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon ([[phab:T217152|T217152]])
* 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 ([[phab:T218341|T218341]])
* 21:32 gtirloni: rebooted tools-exec-1020 ([[phab:T218341|T218341]])
* 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 ([[phab:T218341|T218341]])
* 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled ([[phab:T217152|T217152]])
* 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
* 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
* 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
* 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
* 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
* 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
* 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
* 20:36 bd808: depooled and rebooted tools-sgeexec-0908
* 19:08 gtirloni: rebooted tools-worker-1028 ([[phab:T218341|T218341]])
* 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 ([[phab:T218341|T218341]])
* 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
* 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)
 
=== 2019-03-13 ===
* 23:30 bd808: Rebuilding stretch Kubernetes images
* 22:55 bd808: Rebuilding jessie Kubernetes images
* 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
* 17:10 bstorm_: rebooted cron server
* 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
* 12:33 arturo: reboot tools-sgebastion-08 ([[phab:T215154|T215154]])
* 12:17 arturo: reboot tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:53 arturo: enable puppet in tools-sgebastion-07 ([[phab:T215154|T215154]])
* 11:20 arturo: disable puppet in tools-sgebastion-07 for testing [[phab:T215154|T215154]]
* 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
* 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
* 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 ([[phab:T217406|T217406]])
 
=== 2019-03-11 ===
* 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot ([[phab:T218038|T218038]])
* 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI ([[phab:T218038|T218038]])
* 15:42 bd808: Rebooting tools-sgegrid-master ([[phab:T218038|T218038]])
* 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
* 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
 
=== 2019-03-10 ===
* 22:36 gtirloni: increased nscd group TTL from 60 to 300sec
 
=== 2019-03-08 ===
* 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
* 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization ([[phab:T217280|T217280]])
* 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)
 
=== 2019-03-07 ===
* 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
* 04:15 bd808: Killed 3 orphan processes on Trusty grid
* 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups ([[phab:T217280|T217280]])
* 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch [[phab:T217406|T217406]]
* 00:38 zhuyifei1999_: published misctools 1.37 [[phab:T217406|T217406]]
* 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild [[phab:T217406|T217406]]
 
=== 2019-03-06 ===
* 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02
 
=== 2019-03-04 ===
* 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for [[phab:T217473|T217473]]
* {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}
 
=== 2019-03-03 ===
* 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412
 
=== 2019-02-28 ===
* 19:36 zhuyifei1999_: built with debuild instead [[phab:T217297|T217297]]
* 19:08 zhuyifei1999_: test failures during build, see ticket
* 18:55 zhuyifei1999_: start building jobutils 1.36 [[phab:T217297|T217297]]
 
=== 2019-02-27 ===
* 20:41 andrewbogott: restarting nginx on tools-checker-01
* 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
* 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test [[phab:T176027|T176027]]
* 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
* 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon ([[phab:T217152|T217152]])
* 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
* 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs ([[phab:T217152|T217152]])
 
=== 2019-02-26 ===
* 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
* 19:01 gtirloni: pushed updated docker images
* 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test
 
=== 2019-02-25 ===
* 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for [[phab:T217066|T217066]]
* 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test [[phab:T217066|T217066]]
* 13:11 chicocvenancio: PAWS:  Stopped AABot notebook pod [[phab:T217010|T217010]]
* 12:54 chicocvenancio: PAWS:  Restarted Criscod notebook pod [[phab:T217010|T217010]]
* 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod [[phab:T217010|T217010]]
* 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} ([[phab:T216988|T216988]])
* 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
* 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
* 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
* 07:48 zhuyifei1999_: systemd stuck in D state. :(
* 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
* 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
* 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.
 
=== 2019-02-22 ===
* 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
* 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
* 15:13 gtirloni: shutdown tools-puppetmaster-01
 
=== 2019-02-21 ===
* 09:59 gtirloni: upgraded all packages in all stretch nodes
* 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
* 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up
 
=== 2019-02-20 ===
* 23:30 zhuyifei1999_: begin rebuilding all docker images [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
* 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
* 23:17 zhuyifei1999_: begin build new tools-webservice package [[phab:T178601|T178601]] [[phab:T193646|T193646]] [[phab:T215683|T215683]]
* 21:57 andrewbogott: moving tools-static-13  to a new virt host
* 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
* 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
* 16:56 andrewbogott: moving tools-paws-worker-1003
* 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
* 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442
 
=== 2019-02-19 ===
* 01:49 bd808: Revoked Toolforge project membership for user DannyS712 ([[phab:T215092|T215092]])
 
=== 2019-02-18 ===
* 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
* 20:22 gtirloni: enabled toolsdb monitoring in Icinga
* 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
* 18:50 chicocvenancio: moving paws back to toolsdb [[phab:T216208|T216208]]
* 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness
 
=== 2019-02-17 ===
* 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
* 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
* 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever
 
=== 2019-02-16 ===
* 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
* 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
* 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
* 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
* 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
* 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
* 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
* 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
* 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
* 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
* 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
* 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
* 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
* 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
* 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP
 
=== 2019-02-14 ===
* 21:57 bd808: Deleted old tools-proxy-02 instance
* 21:57 bd808: Deleted old tools-proxy-01 instance
* 21:56 bd808: Deleted old tools-package-builder-01 instance
* 20:57 andrewbogott: rebooting tools-worker-1005
* 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
* 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
* 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
* 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
* 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
* 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
* 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
* 17:35 arturo: [[phab:T215154|T215154]] tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
* 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r
 
=== 2019-02-13 ===
* 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
* 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml{{!}}awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 [[phab:T216042|T216042]]
* 13:03 arturo: [[phab:T216030|T216030]] switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07
 
=== 2019-02-12 ===
* 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers ([[phab:T215704|T215704]])
 
=== 2019-02-11 ===
* 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
* 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
* 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
* 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
* 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
* 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
* 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
* 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
* 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 ([[phab:T107878|T107878]])
* 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 ([[phab:T107878|T107878]])
* 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 ([[phab:T107878|T107878]])
* 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos ([[phab:T107878|T107878]])
* 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1
 
=== 2019-02-08 ===
* 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
* 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for [[phab:T210829|T210829]].
* 13:49 gtirloni: upgraded all packages in SGE cluster
* 12:25 arturo: install aptitude in tools-sgebastion-06
* 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - [[phab:T215272|T215272]]
* 01:07 bd808: Creating tools-sgebastion-07
 
=== 2019-02-07 ===
* 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
* 20:18 gtirloni: cleared mail queue on tools-mail-02
* 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - [[phab:T215272|T215272]]
 
=== 2019-02-04 ===
* 13:20 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06
* 12:26 arturo: [[phab:T215154|T215154]] another reboot for tools-sgebastion-06. Puppet is disabled
* 11:38 arturo: [[phab:T215154|T215154]] reboot tools-sgebastion-06 to totally refresh systemd status
* 11:36 arturo: [[phab:T215154|T215154]] manually install systemd 239 in tools-sgebastion-06
 
=== 2019-01-30 ===
* 23:54 gtirloni: cleared apt cache on sge* hosts
 
=== 2019-01-25 ===
* 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch ([[phab:T214668|T214668]])
* 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for [[phab:T214447|T214447]]
* 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for [[phab:T214447|T214447]]
 
=== 2019-01-24 ===
* 11:09 arturo: [[phab:T213421|T213421]] delete tools-services-01/02
* 09:46 arturo: [[phab:T213418|T213418]] delete tools-docker-registry-02
* 09:45 arturo: [[phab:T213418|T213418]] delete tools-docker-builder-05 and tools-docker-registry-01
* 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01
 
=== 2019-01-23 ===
* 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image ([[phab:T214519|T214519]])
* 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image ([[phab:T214519|T214519]])
* 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance ([[phab:T214519|T214519]])
* 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon ([[phab:T214519|T214519]])
* 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
* 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 ([[phab:T211684|T211684]])
 
=== 2019-01-22 ===
* 20:21 gtirloni: published new docker images (all)
* 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs
 
=== 2019-01-21 ===
* 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet
 
=== 2019-01-18 ===
* 21:22 bd808: Forcing php-igbinary update via clush for [[phab:T213666|T213666]]
 
=== 2019-01-17 ===
* 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
* 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
* 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
* 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
* 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
* 17:16 arturo: [[phab:T213421|T213421]] shutdown tools-services-01/02. Will delete VMs after a grace period
* 12:54 arturo: add webservice security group to tools-sge-services-03/04
 
=== 2019-01-16 ===
* 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
* 16:38 arturo: [[phab:T213418|T213418]] shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
* 14:34 arturo: [[phab:T213418|T213418]] point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
* 14:24 arturo: [[phab:T213418|T213418]] allocate floating IPs for tools-docker-registry-03 & 04
 
=== 2019-01-15 ===
* 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
* 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
* 18:29 bstorm_: [[phab:T213711|T213711]] installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
* 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
* 14:21 arturo: [[phab:T213418|T213418]] put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`
 
=== 2019-01-14 ===
* 22:03 bstorm_: [[phab:T213711|T213711]] Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
* 22:03 bstorm_: [[phab:T213711|T213711]] Added ports needed for etcd-flannel to work on the etcd security group in eqiad
* 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
* 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
* 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
* 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
* 16:44 arturo: [[phab:T213418|T213418]] docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
* 14:00 arturo: [[phab:T213421|T213421]] disable updatetools in the new services nodes while building them
* 13:53 arturo: [[phab:T213421|T213421]] delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
* 13:47 arturo: [[phab:T213421|T213421]] create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`
 
=== 2019-01-11 ===
* 11:55 arturo: [[phab:T213418|T213418]] shutdown tools-docker-builder-05, will give a grace period before deleting the VM
* 10:51 arturo: [[phab:T213418|T213418]] created tools-docker-builder-06 in eqiad1
* 10:46 arturo: [[phab:T213418|T213418]] migrating tools-docker-registry-02 from eqiad to eqiad1
 
=== 2019-01-10 ===
* 22:45 bstorm_: [[phab:T213357|T213357]] - Added 24 lighttpd nodes tot he new grid
* 18:54 bstorm_: [[phab:T213355|T213355]] built and configured two more generic web nodes for the new grid
* 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
* 00:12 bstorm_: [[phab:T213353|T213353]] Added 36 exec nodes to the new grid
 
=== 2019-01-09 ===
* 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
* 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
* 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
* 09:59 gtirloni: rebooted tools-checker-01 ([[phab:T213252|T213252]])
 
=== 2019-01-07 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
 
=== 2019-01-06 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
 
=== 2019-01-05 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
 
=== 2019-01-04 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
 
=== 2019-01-03 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01
 
=== 2018-12-21 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]
 
=== 2018-12-20 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002
 
=== 2018-12-17 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`
 
=== 2018-12-11 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
 
=== 2018-12-05 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
 
=== 2018-12-04 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])
 
=== 2018-12-01 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts
 
=== 2018-11-30 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])
 
=== 2018-11-27 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
 
=== 2018-11-26 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])
 
=== 2018-11-20 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 09:43 godog: restart prometheus@tools on prometheus-01
 
=== 2018-11-16 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 17:47 gtirloni: deleted tools-mail instance
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades
 
=== 2018-11-14 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009
 
=== 2018-11-13 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit
 
=== 2018-11-08 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])
 
=== 2018-11-07 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
 
=== 2018-11-02 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]
 
=== 2018-10-31 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
 
=== 2018-10-29 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
 
=== 2018-10-26 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
 
=== 2018-10-19 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
 
=== 2018-10-18 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
 
=== 2018-10-16 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
 
=== 2018-10-07 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens
 
=== 2018-09-21 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines
 
=== 2018-09-17 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
 
=== 2018-09-14 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API
 
=== 2018-09-08 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
 
=== 2018-09-07 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
 
=== 2018-08-27 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
 
=== 2018-08-22 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
 
=== 2018-08-19 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
 
=== 2018-08-14 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2
 
=== 2018-08-13 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice
 
=== 2018-08-09 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates
 
=== 2018-08-08 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
 
=== 2018-08-06 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
 
=== 2018-08-01 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
 
=== 2018-07-30 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools
 
=== 2018-07-27 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
 
=== 2018-07-25 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
 
=== 2018-07-18 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]
 
=== 2018-06-30 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere
 
=== 2018-06-29 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]
 
=== 2018-06-28 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 16:48 arturo: rebooting tools-docker-registry-01
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck
 
=== 2018-06-21 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-20 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
 
=== 2018-06-14 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
 
=== 2018-06-11 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
 
=== 2018-06-08 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
 
=== 2018-06-07 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
 
=== 2018-06-06 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid
 
=== 2018-06-05 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])
 
=== 2018-06-04 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
 
=== 2018-06-03 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
 
=== 2018-05-31 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
 
=== 2018-05-30 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]
 
=== 2018-05-28 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
 
=== 2018-05-25 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
 
=== 2018-05-22 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
 
=== 2018-05-18 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
 
=== 2018-05-16 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
 
=== 2018-05-15 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414.  It's hanging for unknown reasons.
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding
 
=== 2018-05-12 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
 
=== 2018-05-11 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2018-05-10 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
 
=== 2018-05-09 ===
* 21:11 Reedy: Added Tim Starling as member/admin
 
=== 2018-05-07 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
 
=== 2018-05-05 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
 
=== 2018-05-03 ===
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
 
=== 2018-05-01 ===
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
 
=== 2018-04-27 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
 
=== 2018-04-23 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
 
=== 2018-04-22 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`
 
=== 2018-04-15 ===
* 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] [[phab:T192224|T192224]]
* 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci [[phab:T192224|T192224]]
 
=== 2018-04-11 ===
* 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
 
=== 2018-04-06 ===
* 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
* 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to [[phab:T159254|T159254]]
* 11:23 arturo: manually upgrade apache2 on tools-puppemaster for [[phab:T159254|T159254]]
 
=== 2018-04-05 ===
* 18:46 chicocvenancio: killed wget that was hogging io
 
=== 2018-03-29 ===
* 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
* 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done
 
=== 2018-03-28 ===
* 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
 
=== 2018-03-26 ===
* 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-23 ===
* 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/
 
=== 2018-03-22 ===
* 22:04 bd808: Forced puppet run on tools-proxy-02 for [[phab:T130748|T130748]]
* 21:52 bd808: Forced puppet run on tools-proxy-01 for [[phab:T130748|T130748]]
* 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
* 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-21 ===
* 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
* 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid ([[phab:T190185|T190185]])
 
=== 2018-03-20 ===
* 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) [[phab:T189018|T189018]] [[phab:T190126|T190126]]
 
=== 2018-03-19 ===
* 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
 
=== 2018-03-16 ===
* 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
* 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp
 
=== 2018-03-15 ===
* 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot [[phab:T185624|T185624]]
 
=== 2018-03-14 ===
* 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 ([[phab:T181531|T181531]])
* 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 ([[phab:T181531|T181531]])
* 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 ([[phab:T181531|T181531]])
* 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
* 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
* 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full
 
=== 2018-03-12 ===
* 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
* 17:13 arturo: [[phab:T188994|T188994]] upgrading packages from `stable`
* 16:53 arturo: [[phab:T188994|T188994]] upgrading packages from stretch-wikimedia
* 16:33 arturo: [[phab:T188994|T188994]] upgrading packages form jessie-wikimedia
* 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 {{Gerrit|5f3561e}} [[phab:T189430|T189430]]
* 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
* 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
* 13:19 arturo: [[phab:T188994|T188994]] upgrade packages from jessie-backports in all jessie servers
* 12:49 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-updates in all ubuntu servers
* 12:34 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-wikimedia in all ubuntu servers
 
=== 2018-03-08 ===
* 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
* 14:02 arturo: [[phab:T188994|T188994]] upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
 
=== 2018-03-07 ===
* 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
* 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
* 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
* 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/
 
=== 2018-03-06 ===
* 16:15 madhuvishy: Reboot tools-docker-registry-02 [[phab:T189018|T189018]]
* 15:50 madhuvishy: Rebooting tools-worker-1011
* 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
* 15:03 arturo: drain and reboot tools-worker-1011
* 15:03 chasemp: rebooted tools-worker 1001-1008
* 14:58 arturo: drain and reboot tools-worker-1010
* 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
* 14:27 chasemp: reboot tools-worker-100[12]
* 14:23 chasemp: downtime icinga alert for k8s workers ready
* 13:21 arturo: [[phab:T188994|T188994]] in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
* 12:58 arturo: [[phab:T188994|T188994]] upgrading packages in jessie nodes from the oldstable source
* 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
* 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
* 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic ([[phab:T188911|T188911]])
* 11:33 arturo: removing unused kernel packages in ubuntu nodes
* 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster
 
=== 2018-03-05 ===
* 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
* 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb [[phab:T167026|T167026]] [[phab:T181492|T181492]]
* 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for [[phab:T188911|T188911]]
* 14:01 arturo: deleting old kernel packages in jessie instances for [[phab:T188911|T188911]]
* 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
* 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for [[phab:T187193|T187193]]
* 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for [[phab:T187193|T187193]]
 
=== 2018-03-02 ===
* 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
 
=== 2018-03-01 ===
* 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
 
=== 2018-02-27 ===
* 17:37 chasemp: add chico as admin to toolsbeta
* 12:23 arturo: running `apt-get autoclean` in canary servers
* 12:16 arturo: running `apt-get autoremove` in canary servers
 
=== 2018-02-26 ===
* 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
* 10:35 arturo: enable puppet in tools-proxy-01
* 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests
 
=== 2018-02-25 ===
* 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
 
=== 2018-02-23 ===
* 19:11 arturo: enable puppet in tools-proxy-01
* 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
* 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
* 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
 
=== 2018-02-22 ===
* 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
 
=== 2018-02-21 ===
* 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
* 18:15 arturo: puppet should be fine across the fleet
* 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
* 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
* 16:59 arturo: puppet is broken across the cluster due to last change
* 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
* 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
* 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
* 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
* 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
* 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
* 09:18 chicocvenancio: killed io intensive tool job in bastion
* 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...
 
=== 2018-02-20 ===
* 12:42 arturo: upgrading tools-flannel-etcd-01
* 12:42 arturo: upgrading tools-k8s-etcd-01
 
=== 2018-02-19 ===
* 19:13 arturo: upgrade all packages of tools-services-01
* 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
* 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
* 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration
 
=== 2018-02-16 ===
* 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
* 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
* 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
* 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
* 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
* 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
* 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
* 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y
 
=== 2018-02-15 ===
* 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for [[phab:T187435|T187435]]
* 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
* 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
* 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
* 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
* 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
* 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
 
=== 2018-02-14 ===
* 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment ([[phab:T187315|T187315]])
* 13:04 arturo: reboot tools-paws-master-01 for [[phab:T187315|T187315]]
 
=== 2018-02-11 ===
* 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
* 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
 
=== 2018-02-09 ===
* 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ [[phab:T179343|T179343]] [[phab:T182562|T182562]] [[phab:T186846|T186846]]
* 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
* 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
* 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
* 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
* 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
* 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 ([[phab:T186830|T186830]])
* 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there
 
=== 2018-02-08 ===
* 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
* 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
* 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
* 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
* 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
* 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
* 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
* 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
* 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
* 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
* 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
* 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
* 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
* 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.
 
=== 2018-02-06 ===
* 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
* 13:05 arturo: unpublish/publish trusty-tools repo
* 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for [[phab:T186539|T186539]] after adding it to trusty-tools repo (self contained)
 
=== 2018-02-05 ===
* 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address [[phab:T186539|T186539]]
* 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
* 13:06 arturo: deploying fix for [[phab:T186230|T186230]] using clush
 
=== 2018-02-03 ===
* 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools  python3 ./broken_ref_anchors.py"
 
=== 2018-01-31 ===
* 22:54 chasemp: add bstorm to sudoers as root
 
=== 2018-01-29 ===
* 20:02 chasemp: add zhuyifei1999_ tools root for  [[phab:T185577|T185577]]
* 20:01 chasemp: blast a puppet run to see if any errors are persistent
 
=== 2018-01-28 ===
* 22:49 chicocvenancio: killed compromised session generating miner processes
* 22:48 chicocvenancio: killed miner processes in tools-bastion-03
 
=== 2018-01-27 ===
* 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
* 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
 
=== 2018-01-25 ===
* 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
* 23:20 arturo: [[phab:T179386|T179386]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 05:25 arturo: deploying misctools and jobutils 1.29 for [[phab:T179386|T179386]]
 
=== 2018-01-23 ===
* 19:41 madhuvishy: Add bstorm to project admins
* 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
* 14:17 chasemp: add me, arturo, chico to sudoers and removed marc
 
=== 2018-01-22 ===
* 18:32 arturo: [[phab:T181948|T181948]] [[phab:T185314|T185314]] deploying jobutils and misctools v1.28 in the cluster
* 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
* 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
* 10:18 arturo: [[phab:T181948|T181948]] deploy misctools 1.27 in the cluster
 
=== 2018-01-19 ===
* 17:32 arturo: [[phab:T185314|T185314]] deploying new version of jobutils 1.27
* 12:56 arturo: the puppet status across the fleet seems good, only minor things like [[phab:T185314|T185314]] , [[phab:T179388|T179388]] and [[phab:T179386|T179386]]
* 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
 
=== 2018-01-18 ===
* 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to [[phab:T182781|T182781]])
* 15:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 13:52 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter {{!}} grep lsbdistcodename {{!}} grep trusty && sudo apt-upgrade trusty-wikimedia -v'
* 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
* 12:24 arturo: [[phab:T178717|T178717]] aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
* 12:11 arturo: [[phab:T178717|T178717]] aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
* 11:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-17 ===
* 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions {{!}} grep upgradeable {{!}} grep trusty-wikimedia' {{!}} tee pending-upgrades-report-trusty-wikimedia.txt
* 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' {{!}} tee pending-upgrades-report.txt
* 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
* 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
* 15:04 andrewbogott: depooling exec-manage tools-exec-1430.  Experimenting with purge-old-kernels
* 14:09 arturo: [[phab:T181647|T181647]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-16 ===
* 22:01 chasemp: qstat -explain E -xml {{!}} grep 'name' {{!}} sed 's/<name>//' {{!}} sed 's/<\/name>//'  {{!}} xargs qmod -cq
* 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
* 21:24 andrewbogott: repooled tools-exec-1420  and tools-webgrid-lighttpd-1417
* 21:14 andrewbogott: depooling tools-exec-1420  and tools-webgrid-lighttpd-1417
* 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412  and tools-exec-1423 for host reboot
* 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413  tools-exec-1442 for host reboot
* 18:50 andrewbogott: switched active proxy back to tools-proxy-02
* 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
* 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
* 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
* 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
* 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
* 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
* 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
* 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
* 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
* 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
* 13:35 chasemp: tools-mail  almouked@ltnet.net 719 pending messages cleared
 
=== 2018-01-11 ===
* 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
* 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
* 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 19:00 chasemp: reboot tools-worker-1015
* 15:08 chasemp: reboot tools-exec-1405
* 15:06 chasemp: reboot tools-exec-1404
* 15:06 chasemp: reboot tools-exec-1403
* 15:02 chasemp: reboot tools-exec-1402
* 14:57 chasemp: reboot tools-exec-1401 again...
* 14:53 chasemp: reboot tools-exec-1401
* 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot
 
=== 2018-01-10 ===
* 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
* 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
* 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
* 13:57 arturo: [[phab:T184604|T184604]] cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
* 13:46 arturo: [[phab:T184604|T184604]] aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
* 13:45 arturo: [[phab:T184604|T184604]] aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
* 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
* 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
* 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
* 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for [[phab:T184604|T184604]]
* 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened [[phab:T184604|T184604]]
 
=== 2018-01-09 ===
* 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
* 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
* 23:01 yuvipanda: kill paws master and reboot it
* 22:54 yuvipanda: kill all kube-system pods in paws cluster
* 22:54 yuvipanda: kill all PAWS pods
* 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
* 22:49 yuvipanda: run  clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
* 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes
* 22:46 yuvipanda: reboot all paws-worker nodes
* 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
* 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
* 20:55 chasemp: for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016`; do kubectl cordon $n; done
* 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
* 20:15 chasemp: disable puppet on proxies and k8s workers
* 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
* 19:42 chasemp: reboot tools-worker-1010
 
=== 2018-01-08 ===
* 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
* 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
 
=== 2018-01-06 ===
* 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
* 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)
 
=== 2018-01-05 ===
* 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
* 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
* 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
* 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)
 
=== 2018-01-04 ===
* 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of [[phab:T184018|T184018]]
 
=== 2018-01-03 ===
* 15:38 bd808: Forced Puppet run on tools-services-01
* 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush


==Archives==
==Archives==
* [[/Archive 1|Archive 1]] (2013-2014)
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
* [[/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
</noinclude>
</noinclude>
{{SAL|Project Name=tools}}
{{SAL|Project Name=tools}}
<noinclude>[[Category:SAL]]</noinclude>
<noinclude>[[Category:SAL]]</noinclude>

Revision as of 21:23, 28 September 2022

2022-09-28

  • 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858
  • 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, T318858
  • 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)

2022-09-22

  • 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group T317438
  • 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy T317438

2022-09-10

  • 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko

2022-09-07

  • 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks (T316854)

2022-09-06

  • 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder (T316854)

2022-08-25

  • 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner T293552

2022-08-24

2022-08-20

  • 07:44 dcaro_away: all k8s nodes ready now \o/ (T315718)
  • 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up (T315718)
  • 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking (T315718)

2022-08-18

  • 14:45 andrewbogott: adding lucaswerkmeister as projectadmin (T314527)
  • 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair

2022-08-17

  • 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # T315459
  • 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected

2022-08-16

  • 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05

2022-08-11

  • 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
  • 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues

2022-08-05

  • 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-08-03

2022-07-20

  • 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
  • 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-07-19

  • 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest

2022-07-17

  • 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-07-14

  • 13:48 taavi: rebooting tools-sgeexec-10-2

2022-07-13

  • 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-07-11

  • 16:06 wm-bot2: Increased quotas by {self.increases} (T312692) - cookbook ran by nskaggs@x1carbon

2022-07-07

  • 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-06-28

  • 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master (T311538) - cookbook ran by dcaro@vulcanus
  • 15:51 taavi: add 4096G cinder quota T311509

2022-06-27

  • 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
  • 18:02 taavi: switchover active cron server to tools-sgecron-2 T284767
  • 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:15 taavi: T311412 updating ca used by k8s-apiserver->etcd communication, breakage may happen
  • 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 T311412
  • 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it T311412

2022-06-23

  • 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653

2022-06-22

  • 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-21

  • 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-03

  • 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
  • 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
  • 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
  • 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
  • 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
  • 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
  • 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
  • 13:20 bd808: publish tools-webservice 0.86 (T309821)
  • 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
  • 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
  • 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
  • 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
  • 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
  • 03:10 bd808: publish tools-webservice 0.85 with hack for T309821

2022-06-02

  • 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
  • 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
  • 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
  • 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
  • 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
  • 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:03 dcaro: refresh prometheus certs (T308402)
  • 11:47 dcaro: refresh registry-admission-controller certs (T308402)
  • 11:42 dcaro: refresh ingress-admission-controller certs (T308402)
  • 11:36 dcaro: refresh volume-admission-controller certs (T308402)
  • 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster T277653
  • 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653
  • 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-06-01

  • 11:18 taavi: depool and remove tools-sgeexec-09[07-14]

2022-05-31

  • 16:51 taavi: delete tools-sgeexec-0904 for T309525 experimentation

2022-05-30

  • 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) T277653

2022-05-26

2022-05-22

  • 17:04 taavi: failover tools-redis to the updated cluster T278541
  • 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud (T308982) - cookbook ran by taavi@runko

2022-05-16

2022-05-14

  • 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940

2022-05-12

2022-05-10

  • 15:18 taavi: depool tools-k8s-worker-42 for experiments
  • 13:54 taavi: enable distro-wikimedia unattended upgrades T290494

2022-05-06

  • 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl (T307812)

2022-05-05

  • 17:28 taavi: deploy tools-webservice 0.83 T307693

2022-05-03

  • 08:20 taavi: redis: start replication from the old cluster to the new one (T278541)

2022-05-02

  • 08:54 taavi: restart acme-chief.service T307333

2022-04-25

  • 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 (T214343)
  • 14:46 bd808: Building toolforge-webservice v0.82

2022-04-23

  • 16:51 bd808: Built new perl532-sssd/{base,web} images and pushed to registry (T214343)

2022-04-20

2022-04-16

2022-04-12

  • 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' (T305986)
  • 21:27 bd808: Added komla to 'roots' sudoers policy (T305986)
  • 21:24 bd808: Add komla as projectadmin (T305986)

2022-04-10

  • 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)

2022-04-09

  • 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /

2022-04-08

  • 10:44 arturo: disabled debug mode on the k8s jobs-emailer component

2022-04-05

2022-04-04

2022-03-28

  • 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo

2022-03-15

2022-03-14

  • 11:44 arturo: deploy jobs-framework-emailer 9470a5f (T286135)
  • 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo (T297090)

2022-03-10

  • 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902

2022-03-01

  • 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state (T302702)
  • 12:11 dcaro: Cleared error state queues for sgeexec-0916 (T302702)
  • 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand

2022-02-28

  • 08:02 taavi: reboot sgeexec-0916
  • 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /

2022-02-17

  • 08:23 taavi: deleted tools-clushmaster-02
  • 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access

2022-02-16

  • 00:12 bd808: Image builds completed.

2022-02-15

  • 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
  • 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
  • 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
  • 22:50 bd808: Built new toollabs-webservice 0.81
  • 18:43 bd808: Enabled puppet on tools-proxy-05
  • 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
  • 18:21 taavi: delete tools-package-builder-03
  • 11:49 arturo: invalidate sssd cache in all bastions to debug T301736
  • 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for T301736
  • 11:15 arturo: reboot tools-sgebastion-10 for T301736

2022-02-10

  • 15:07 taavi: shutdown tools-clushmaster-02 T298191
  • 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally T214427
  • 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - T214427
  • 08:06 taavi: disable puppet globally for enabling puppetdb T214427

2022-02-09

  • 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet T214427
  • 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 18:25 arturo: ignore last message
  • 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 T298191

2022-02-07

  • 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository (T288406)
  • 12:52 taavi: updated maintain-kubeusers for T301081

2022-02-04

  • 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015
  • 21:36 taavi: clear error state from some webgrid nodes

2022-02-03

  • 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
  • 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate

2022-01-30

  • 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover T278541
  • 14:22 taavi: creating a cluster of 3 bullseye redis hosts for T278541

2022-01-26

  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
  • 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
  • 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
  • 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes (T277653)

2022-01-25

  • 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 11:44 arturo: rebooting buster exec nodes
  • 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4

2022-01-24

  • 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 15:23 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-20

  • 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
  • 12:56 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-19

  • 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move

2022-01-14

  • 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, T299243

2022-01-12

  • 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
  • 11:03 arturo: created puppet prefix 'tools-sgeweblig'
  • 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'

2022-01-04

  • 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
  • 08:12 taavi: disable puppet & exim4 on T298501

Archives