You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(andrewbogott: rebooting tools-cron-01)
imported>Stashbot
(lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858)
(723 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2017-06-30 ===
=== 2022-09-28 ===
* 01:29 andrewbogott: rebooting tools-cron-01
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)


=== 2017-06-29 ===
=== 2022-09-22 ===
* 23:01 madhuvishy: Uncordoned all k8s-workers
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 20:50 madhuvishy: deppoling, rebooting and repooling all grid exec nodes
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]
* 20:36 andrewbogott: depooling, rebooting, and repooling every lighttpd node three at a time
* 19:55 madhuvishy: Killed liangent-php jobs and usrd-tools jobs
* 18:00 madhuvishy: drain cordon reboot uncordon tools-worker-1015
* 17:37 madhuvishy: drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
* 17:22 bd808: rebooting tools-static-11
* 17:20 andrewbogott: rebooting tools-static-10
* 17:20 madhuvishy: drain cordon reboot uncordon tools-worker-1012 tools-worker-1003
* 17:13 madhuvishy: drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
* 16:27 chasemp: restart k8s components on master (madhu)
* 16:10 chasemp: tools-flannel-etcd-01:~$ sudo service etcd restart
* 16:04 madhuvishy: reboot tools-worker-1022 tools-worker-1009
* 15:57 chasemp: reboot tools-docker-registery-01 for nfs


=== 2017-06-27 ===
=== 2022-09-10 ===
* 21:32 andrewbogott: moving all tools nodes to new puppetmaster, tools-puppetmaster-01.tools.eqiad.wmflabs
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko


=== 2017-06-25 ===
=== 2022-09-07 ===
* 15:13 madhuvishy: Restarted webservice on tools.fatameh
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])


=== 2017-06-24 ===
=== 2022-09-06 ===
* 16:01 bd808: Created and provisioned elasticsearch password for tools.wmde-uca-test ([[phab:T167971|T167971]])
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])


=== 2017-06-23 ===
=== 2022-08-25 ===
* 20:20 bd808: Reindexing various elasticsearch indexes created before we upgraded to v2.x
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]
* 20:19 bd808: Dropped garbage indexes in elasticsearch cluster


=== 2017-06-22 ===
=== 2022-08-24 ===
* 17:03 bd808: Rolled back attempt at Elasticsearch upgrade. Indices need to be rebuilt with 2.x before 5.x can be installed. [[phab:T164842|T164842]]
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 16:19 bd808: Backed up elasticsearch indexes to personal laptop using elasticdump incase [[phab:T164842|T164842]] goes horribly wrong
* 12:20 taavi: upgrading ingress-nginx to v1.3
* 00:12 bd808: Set ownership and permissions on $HOME/.kube for all tools ([[phab:T165875|T165875]])


=== 2017-06-21 ===
=== 2022-08-20 ===
* 17:43 andrewbogott: repooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 17:42 madhuvishy: Restarted webservice for openstack-browser
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 17:36 andrewbogott: depooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 17:35 andrewbogott: repooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
* 17:24 andrewbogott: depooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
* 17:23 andrewbogott: repooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
* 17:11 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
* 17:10 andrewbogott: repooling tools-webgrid-lighttpd-1412, tools-exec-1423
* 16:57 andrewbogott: depooling tools-webgrid-lighttpd-1412, tools-exec-1423
* 16:53 andrewbogott: repooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
* 16:52 andrewbogott: repooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 16:35 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 16:29 andrewbogott: depooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
* 16:05 godog: delete pods for lolrrit-wm to force restart
* 15:45 andrewbogott: repooling tools-exec-1422, tools-webgrid-lighttpd-1413
* 15:41 andrewbogott: switching the proxy ip back to tools-proxy-02
* 15:31 andrewbogott: temporarily pointing the tools-proxy IP to tools-proxy-01
* 15:26 andrewbogott: depooling tools-exec-1422, tools-webgrid-lighttpd-1413
* 15:12 andrewbogott: depooling tools-exec-1404, tools-exec-1434, tools-worker-1026
* 15:10 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 14:53 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 14:52 andrewbogott: repooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
* 14:37 andrewbogott: depooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
* 14:32 andrewbogott: repooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
* 14:20 andrewbogott: depooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
* 14:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407
* 13:56 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407


=== 2017-06-14 ===
=== 2022-08-18 ===
* 22:09 bd808: Restarted apache2 proc on tools-puppetmaster-02
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair


=== 2017-06-08 ===
=== 2022-08-17 ===
* 18:14 madhuvishy: Also delete from /tmp on tools-webgrid-lighttpd-1411 xvfb-run.*, calibre_* and ws-*.epub
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 18:10 madhuvishy: Delete ws-*.epub from /tmp on tools-webgrid-lighttpd-1426
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected
* 18:07 madhuvishy: Clean up space on /tmp on tools-webgrid-lighttpd-1426 by deleting temp files xvfb-run.* and calibre_1.25.0_tmp_* created by the wsexport tool


=== 2017-06-07 ===
=== 2022-08-16 ===
* 19:05 madhuvishy: Killed scp job run by user torin8 on tools-bastion-02
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05


=== 2017-06-06 ===
=== 2022-08-11 ===
* 20:30 chasemp: rebooting tools-bastion-02 as unresponsive (up 76 days and lots of seemingly left behind things running)
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues


=== 2017-06-05 ===
=== 2022-08-05 ===
* 23:44 bd808: Deleted tools.iabot crontab that somehow got locally installed on tools-exec-1412 on 2017-05-24T20:55Z
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:15 bd808: Deleted tools.aibot crontab that somehow got locally installed on tools-exec-1436 on 2017-05-24T20:55Z
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 19:55 andrewbogott: disabling puppet on tools-proxy-01 and -02 for a staged rollout of https://gerrit.wikimedia.org/r/#/c/350494/16
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2017-06-01 ===
=== 2022-08-03 ===
* 15:15 andrewbogott: depooling/rebooting/repooling tools-exec-1403 as part of old kernel-purge testing
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2017-05-31 ===
=== 2022-07-20 ===
* 19:29 bd808: Rebuiding all Docker images to pick up toollabs-webservice v0.37 ([[phab:T163355|T163355]])
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 19:24 bd808: Updating toolabs-webservice package via clush ([[phab:T163355|T163355]])
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 19:16 bd808: Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 ([[phab:T163355|T163355]])
* 16:34 andrewbogott: running 'apt-get -yq autoremove' env='{DEBIAN_FRONTEND: "noninteractive"}' on all instances with salt
* 16:25 andrewbogott: rebooting tools-exec-1404 as part of a disk-space-saving test
* 14:07 andrewbogott: migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 ([[phab:T165753|T165753]])


=== 2017-05-30 ===
=== 2022-07-19 ===
* 22:32 andrewbogott: migrating tools-webgrid-lighttpd-1406, tools-exec-1410  from labvirt1006 to labvirt1009 to balance cpu usage
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 18:15 andrewbogott: restarted robokobot virgule to free up leaked files
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:36 andrewbogott: restarting excel2wiki to clean up file leaks
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:36 andrewbogott: restarting idwiki-welcome in kenrick95bot to free up leaked files
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 17:31 andrewbogott: restarting onetools to clean up file leaks
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest
* 17:29 andrewbogott: restarting ytcleaner webservice to clean up leaked files
* 17:22 andrewbogott: restarting vltools to clean up leaked files
* 17:20 madhuvishy: Uncordoned tools-worker-1006
* 17:16 madhuvishy: Killed tool videoconvert on tools-exec-1440 in debugging labstore disk space issues
* 17:15 madhuvishy: Drained and rebooted tools-worker-1006
* 17:15 andrewbogott: restarted croptool to clean up stray files
* 17:15 madhuvishy: depooled, rebooted, and repooled tools-exec-1412
* 17:15 andrewbogott: restarted catmon tool to clean up stray files


=== 2017-05-26 ===
=== 2022-07-17 ===
* 20:32 bd808: Added tools-webgrid-lighttpd-14{19,2[0-8]} as submit hosts
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 20:31 bd808: Added tools-webgrid-lighttpd-1412 and tools-webgrid-lighttpd-1413 as submit hosts
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 20:28 bd808: sudo qconf -as tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2017-05-22 ===
=== 2022-07-14 ===
* 07:49 chasemp: move ooooold shared resources into archive for later cleanup
* 13:48 taavi: rebooting tools-sgeexec-10-2


=== 2017-05-20 ===
=== 2022-07-13 ===
* 09:27 madhuvishy: Truncating jerr.log for tool videoconvert since it's 967GB
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2017-05-10 ===
=== 2022-07-11 ===
* 19:11 bd808: Edited striker db record for user Stepan Grigoryev to detach SUL and Phab accounts. [[phab:T164849|T164849]]
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon
* 17:47 bd808: Signed and revoked puppet certs generated when our DNS flipped out and gave hosts non-FQDN hostnames
* 17:29 bd808: Fixed broken puppet cert on tools-package-builder-01


=== 2017-05-04 ===
=== 2022-07-07 ===
* 19:23 madhuvishy: Rebooting tools-grid-shadow
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 16:21 madhuvishy: Start instance tools-grid-master.tools from horizon
* 16:20 madhuvishy: Shut off tools-grid-master.tools instance from horizon
* 16:16 madhuvishy: Stopped gridengine-shadow on tools-grid-shadow.tools (service gridengine-shadow stop and kill -9 individual shadowd processes)


=== 2017-04-24 ===
=== 2022-06-28 ===
* 15:33 bd808: Removed Gergő Tisza as a projectadmin for [[phab:T163611|T163611]]; event done
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]


=== 2017-04-21 ===
=== 2022-06-27 ===
* 22:30 bd808: Added Gergő Tisza as a projectadmin for [[phab:T163611|T163611]]
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 13:43 chasemp: [[phab:T161898|T161898]] clush -g all 'sudo puppet agent --disable "rollout nfs-mount-manager"'
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2017-04-20 ===
=== 2022-06-23 ===
* 17:15 bd808: Deleted shutdown VM tools-docker-builder-04; tools-docker-builder-05 is the new hotness
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 bd808: kill -INT 19897 on tools-proxy-02 to stop a hung nginx child process left from the last graceful restart of nginx
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2017-04-19 ===
=== 2022-06-22 ===
* 15:10 bd808: apt-get install psmisc on tools-proxy-0[12]
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:23 chasemp: stop docker on tools-proxy-01
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:20 chasemp: clean up disk space on tools-proxy-01
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2017-04-18 ===
=== 2022-06-21 ===
* 20:37 bd808: Restarted bigbrother on tools-services-02
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 04:23 bd808: Shutdown tools-docker-builder-04; will wait a bit before deleting
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 04:04 bd808: Built and pushed new Docker images based on {{Gerrit|82a46b4}} (Refactor apt-get actions in Dockerfiles)
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 03:42 bd808: Made tools-docker-builder-05.tools.eqiad.wmflabs the active docker build host
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 01:01 bd808: Built instance tools-package-builder-01


=== 2017-04-17 ===
=== 2022-06-03 ===
* 20:41 bd808: Building tools-docker-builder-05
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:35 chasemp: add reedy to sudo all perms so he can admin things
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 17:21 andrewbogott: adding 8 more exec nodes: tools-exec-1435 through 1442
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2017-04-11 ===
=== 2022-06-02 ===
* 16:46 andrewbogott: added exec nodes tools-exec-1430, 31, 32, 33, 34.
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 14:15 andrewbogott: emptied /srv/pbuilder to make space on tools-docker-04
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 02:35 bd808: Restarted maintain-kubeusers on tools-k8s-master-01
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2017-04-03 ===
=== 2022-06-01 ===
* 13:48 chasemp: enable puppet on gridmaster
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]


=== 2017-04-01 ===
=== 2022-05-31 ===
* 15:28 andrewbogott: added five new exec nodes, tools-exec-1425 through 1429
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 14:26 chasemp: up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/
* 14:00 chasemp: disable puppet on tools-grid-msater
* 13:52 chasemp: tools-grid-master tc-setup clean
* 13:40 chasemp: restart nscd and nscld on tools-grid-master
* 13:31 chasemp: reboot tools-exec-1420


=== 2017-03-31 ===
=== 2022-05-30 ===
* 22:25 yuvipanda: apt-get update && apt-get install kubernetes-node on tools-proxy-01 to upgrade kube-proxy systemd service unit
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]


=== 2017-03-30 ===
=== 2022-05-26 ===
* 20:29 chasemp: stop grid-master temporarily & umount -fl project nfs  & remount & start grid-master
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 17:38 chasemp: reboot tools-exec-1401
* 17:30 madhuvishy: Updating tools project hiera config to add role::labs::nfsclient::lookupcache: all via Horizon ([[phab:T136712|T136712]])
* 17:29 madhuvishy: Disabled puppet across tools in prep for [[phab:T136712|T136712]]


=== 2017-03-27 ===
=== 2022-05-22 ===
* 04:06 andrewbogott: erasing random log files on tools-proxy-01 to avoid filling the disk
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko


=== 2017-03-23 ===
=== 2022-05-16 ===
* 20:38 andrewbogott: migrating tools-exec-1401 to labvirt1001
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko
* 19:56 andrewbogott: migrating tools-exec-1408 to labvirt1001
* 19:02 andrewbogott: migrating tools-exec-1407 to labvirt1001
* 16:37 andrewbogott: migrating tools-webgrid-lighttpd-1402 and 1407 to labvirt1001 (testing labvirt1001 and easing CPU load on labvirt1010)


=== 2017-03-22 ===
=== 2022-05-14 ===
* 13:48 andrewbogott: migrating tools-bastion-02 in 15 minutes
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940


=== 2017-03-21 ===
=== 2022-05-12 ===
* 17:06 andrewbogott: moving tools-webgrid-lighttpd-1404 to labvirt1012 to ease pressure on labvirt1004
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 16:19 andrewbogott: moving tools-exec-1406 to labvirt1011 to ease CPU usage on labvirt1004
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2017-03-20 ===
=== 2022-05-10 ===
* 22:47 yuvipanda: disable puppet on all k8s workers to test https://gerrit.wikimedia.org/r/#/c/343708/
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 18:36 bd808: Applied openstack::clientlib on tools-checker-02 and forced puppet run
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]
* 18:03 bd808: Applied openstack::clientlib on tools-checker-01 and forced puppet run
* 17:31 andrewbogott: migrating tools-exec-1417 to labvirt1013
* 17:05 andrewbogott: migrating tools-webgrid-lighttpd-1410 to labvirt1012 to reduce load on labvirt1001
* 16:42 andrewbogott: migrating tools-webgrid-generic-1404 to labvirt1011 to reduce load on labvirt1001
* 16:13 andrewbogott: migrating tools-exec-1408 to labvirt1010 to reduce load on labvirt1001


=== 2017-03-17 ===
=== 2022-05-06 ===
* 17:24 andrewbogott: moving tools-webgrid-lighttpd-1416 to labvirt1013 to reduce load on labvirt1004
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 17:15 andrewbogott: moving tools-exec-1424 to labvirt1012 to ease load on labvirt1004


=== 2017-03-15 ===
=== 2022-05-05 ===
* 19:21 andrewbogott: added new exec nodes:  tools-exec-1421 and tools-exec-1422
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 17:42 madhuvishy: Restarted stashbot
* 17:29 chasemp: docker stop && rm -fR /var/lib/docker/* on worker-1001
* 17:20 chasemp: test of logging
* 16:11 chasemp: k8s master 'for h in `kubectl get nodes {{!}} grep worker {{!}} grep -v NotReady {{!}} grep -v Disabled {{!}} awk '{print $1}'`; do echo $h && kubectl drain --delete-local-data --force $h && sleep 10 ; done'
* 16:08 chasemp: stop puppet on k8s master and drain nodes
* 15:50 chasemp: (late) kill what appears to be an android emulator? unsure but it's eating all IO


=== 2017-03-14 ===
=== 2022-05-03 ===
* 21:24 bd808: Deleted tools-precise-dev ([[phab:T160466|T160466]])
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 21:13 bd808: Removed non-existent tools-submit.eqiad.wmflabs from submit hosts list
* 21:02 bd808: Deleted tools-exec-gift ([[phab:T160461|T160461]])
* 20:45 bd808: Deleted tools-webgrid-lighttpd-12* nodes ([[phab:T160442|T160442]])
* 20:29 bd808: Deleted tools-exec-12* nodes ([[phab:T160457|T160457]])
* 20:27 bd808: Disassociated floating IPs from tools-exec-12* nodes ([[phab:T160457|T160457]])
* 17:41 madhuvishy: Hand fix tools-puppetmaster by removing the old mariadb submodule directory
* 17:23 madhuvishy: Remove role::toollabs::precise_reminder from tools-bastion-03
* 15:40 bd808: Installing toollabs-webservice 0.36 across cluster using clush
* 15:36 bd808: Upgraded toollabs-webservice to 0.36 on tools-bastion-02.tools
* 15:25 bd808: Installing jobutils 1.21 across cluster using clush
* 15:23 bd808: Installed jobutils 1.21 on tools-bastion-02
* 15:03 bd808: Shutting down webservices running on Precise job grid nodes


=== 2017-03-13 ===
=== 2022-05-02 ===
* 21:12 valhallasw`cloud: tools-bastion-03: killed heavy unzip operation from staeiou, and heavy (inadvertent large file opening?) vim operation from steenth, as the entire server was blocked due to high i/o
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2017-03-07 ===
=== 2022-04-25 ===
* 17:59 andrewbogott: depooling, migrating tools-exec-1416 as part of ongoing labvirt1001 issues
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 17:21 madhuvishy: tools-webgrid-lighttpd-1409 migrated to labvirt1011 and repooled
* 14:46 bd808: Building toolforge-webservice v0.82
* 16:31 madhuvishy: Depooled tools-webgrid-lighttpd-1409 for cold migrating to different labvirt


=== 2017-03-06 ===
=== 2022-04-23 ===
* 22:52 andrewbogott: migrating tools-webgrid-lighttpd-1411 to labvirt1011 to give labvirt1001 a break
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])
* 19:03 madhuvishy: Stopping webservice running on tool tree-of-life on author request
* 18:25 yuvipanda: set complex_values        slots=300,release=trusty  for tools-exec-gift-trusty-01.tools.eqiad.wmflabs


=== 2017-03-04 ===
=== 2022-04-20 ===
* 23:47 madhuvishy: Added new k8s workers 1028, 1029
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2017-02-28 ===
=== 2022-04-16 ===
* 03:52 scfc_de: Deployed jobtools and misctools 1.20/1.20~precise+1 ([[phab:T158722|T158722]]).
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko


=== 2017-02-27 ===
=== 2022-04-12 ===
* 02:42 scfc_de: Purged misctools from instances where not puppetized.
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 02:42 scfc_de: Deployed jobtools and misctools 1.19/1.19~precise+1 ([[phab:T155787|T155787]], [[phab:T156886|T156886]]).
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2017-02-17 ===
=== 2022-04-10 ===
* 12:51 chasemp: create tools-exec-gift-trusty-01
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
* 12:40 chasemp: create tools-exec-gift-trusty
* 12:24 chasemp: mass apt-get clean and removal of some old .gz log files due to 30+ low space warnings


=== 2017-02-15 ===
=== 2022-04-09 ===
* 18:45 yuvipanda: clush a restart of nscd across all of tools
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 00:01 bd808: Rebuilt python and python2 Docker images ([[phab:T157744|T157744]])


=== 2017-02-08 ===
=== 2022-04-08 ===
* 06:22 yuvipanda: drain tools-worker-1026 for docker upgrade
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 05:28 yuvipanda: drain pods from tools-worker-1027.tools.eqiad.wmflabs for docker upgrade
* 05:28 yuvipanda: disable puppet on all k8s nodes in preparation for docker upgrade


=== 2017-02-07 ===
=== 2022-04-05 ===
* 13:49 scfc_de: Deployed toollabs-webservice_0.33_all.deb ([[phab:T156605|T156605]], [[phab:T156626|T156626]]).
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 13:49 scfc_de: Deployed tools-manifest_0.11_all.deb.
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7


=== 2017-02-04 ===
=== 2022-04-04 ===
* 02:13 yuvipanda: launch tools-worker-1027 to see if puppet works fine on first run!
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 02:13 yuvipanda: reboot tools-worker-1026 to see if it comes up fine
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 01:46 yuvipanda: launch tools-worker-1026
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions


=== 2017-02-03 ===
=== 2022-03-28 ===
* 21:34 madhuvishy: Migrated over precise tools to trusty for user multichill (catbot, family, locator, multichill, nlwikibots, railways,  wlmtrafo, wikidata-janitor)
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo
* 21:13 chasemp: reboot tools-bastion-03 as unresponsive


=== 2017-02-02 ===
=== 2022-03-15 ===
* 20:39 yuvipanda: import docker-engine 1.11.2 (currently running version) and 1.12.6 (latest version) into aptly
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 00:06 madhuvishy: Remove user maximilianklein from tools.cite-o-meter (on request)
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)


=== 2017-01-30 ===
=== 2022-03-14 ===
* 20:25 yuvipanda: sudo ln -s /usr/bin/kubectl /usr/local/bin/kubectl to temporarily fix webservice shell not working
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])


=== 2017-01-27 ===
=== 2022-03-10 ===
* 19:22 chasemp: reboot tools-bastion-02 as it is having issues
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902
* 02:01 madhuvishy: Reenabled puppet on tools-checker-01
* 00:29 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/


=== 2017-01-26 ===
=== 2022-03-01 ===
* 23:37 madhuvishy: reenabled puppet on tools-checker
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 23:02 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 16:08 chasemp: major cleanup for stale var items on tools-exec-1221
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2017-01-24 ===
=== 2022-02-28 ===
* 18:14 andrewbogott: one last reboot of tools-mail
* 08:02 taavi: reboot sgeexec-0916
* 18:00 andrewbogott: apt-get autoremove on tools-mail
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 17:51 andrewbogott: rebooting tools-mail post upgrade
* 17:19 andrewbogott: restarting tools-mail, beginning do-release-upgrade -d -q
* 17:17 andrewbogott: backing up tools-mail to ~root/8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5-backup on labvirt1009
* 17:15 andrewbogott: stopping tools-mail, backing up, upgrading from precise to trusty
* 15:49 yuvipanda: clush -g all 'sudo rm /usr/local/bin/kube*' to get rid of old kube related binaries
* 14:42 yuvipanda: re-enable puppet on tools-proxy-01, test success on proxy-02
* 14:37 yuvipanda: disable puppet on tools-proxy-01 (active proxy) to check deploying debianized kube-proxy on proxy-02
* 13:52 yuvipanda: upgrading k8s on worker nodes to use debs + new k8s version
* 13:52 yuvipanda: finished upgrading k8s + using debs
* 12:49 yuvipanda: purge ancient kubectl, kube-apiserver, kube-controller-manager, kube-scheduler packages from tools-k8s-master-01, these were my old terrible packages


=== 2017-01-23 ===
=== 2022-02-17 ===
* 19:36 andrewbogott: temporarily shutting down tools-webgrid-lighttpd-1201
* 08:23 taavi: deleted tools-clushmaster-02
* 19:35 yuvipanda: depool tools-webgrid-lighttpd-1201 for snapshotting tests
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access
* 17:13 chasemp: reboot tools-exec-1411 as having serious transient issues


=== 2017-01-20 ===
=== 2022-02-16 ===
* 15:58 yuvipanda: enabling puppet across all hosts
* 00:12 bd808: Image builds completed.
* 15:36 yuvipanda: disable puppet everywhere to cherrypick patch moving base to a profile
* 00:50 bd808: sudo qdel -f {{Gerrit|1199218}} to force delete a stuck toolschecker job


=== 2017-01-17 ===
=== 2022-02-15 ===
* 18:47 madhuvishy: Reenabled puppet across tools
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 18:26 madhuvishy: Disabling puppet across tools to test https://gerrit.wikimedia.org/r/#/c/329707/
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2017-01-11 ===
=== 2022-02-10 ===
* 22:09 chasemp: add Reedy to admin in tool labs (approved by bryon and chase for access to investigate specific tool abuse behavior)
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2017-01-10 ===
=== 2022-02-09 ===
* 19:05 madhuvishy: Killed 3 jobs from tools.arnaub that were causing high load on tools-exec-1411
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2017-01-06 ===
=== 2022-02-07 ===
* 19:02 bd808: Terminated deprecated instances tools-exec-121[2-6] ([[phab:T154539|T154539]])
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]


=== 2017-01-04 ===
=== 2022-02-04 ===
* 02:43 madhuvishy: Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. [[phab:T152369|T152369]]
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 21:36 taavi: clear error state from some webgrid nodes


=== 2017-01-03 ===
=== 2022-02-03 ===
* 23:56 bd808: Removed tools-exec-12[12-16] from gridengine ([[phab:T154539|T154539]])
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 23:27 bd808: drained tools-exec-1216 ([[phab:T154539|T154539]])
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate
* 23:26 bd808: drained tools-exec-1215 ([[phab:T154539|T154539]])
* 23:25 bd808: drained tools-exec-1214 ([[phab:T154539|T154539]])
* 23:25 bd808: drained tools-exec-1213 ([[phab:T154539|T154539]])
* 23:24 bd808: drained tools-exec-1212 ([[phab:T154539|T154539]])
* 23:11 madhuvishy: Disabled puppet on tools-checker-01 ([[phab:T152369|T152369]])
* 21:43 madhuvishy: Adding iptables rule to drop incoming connections from toolschecker on labservices1001
* 20:51 madhuvishy: Adding iptables rule to block outgoing connections to labservices1001 on tools-checker-01
* 20:43 madhuvishy: Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out  [[phab:T152369|T152369]]


=== 2016-12-25 ===
=== 2022-01-30 ===
* 00:28 yuvipanda: comment out cron running 'clean' script of avicbot every minute without -once
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 00:28 yuvipanda: force delete all jobs of avicbot
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]
* 00:25 yuvipanda: delete all jobs of avicbot. This is 419 jobs
* 00:20 yuvipanda: kill clean.sh process of avicbot


=== 2016-12-19 ===
=== 2022-01-26 ===
* 20:07 valhallasw`cloud: killed gps_exif_bot2.py (tools.gpsexif), was using 50MB/s io, lagging all of tools-bastion-03
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 13:06 yuvipanda: run  /usr/local/bin/deploy-master http://tools-docker-builder-03.tools.eqiad.wmflabs v1.3.3wmf1 on tools-k8s-master-01
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 12:53 yuvipanda: cleaned out pbuilder from tools-docker-builder-01 to clean up
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2016-12-17 ===
=== 2022-01-25 ===
* 04:49 yuvipanda: turned on lookupcache again for bastions
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 11:44 arturo: rebooting buster exec nodes
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2016-12-15 ===
=== 2022-01-24 ===
* 18:52 yuvipanda: reboot tools-exec-1204
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 18:49 yuvipanda: reboot tools-webgrid-lighttpd-12[01-05]
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])
* 18:45 yuvipanda: reboot tools-exec-gift
* 18:41 yuvipanda: reboot tools-exec-1217 to 1221
* 18:30 yuvipanda: rebooted tools-exec-1212 to 1216
* 14:55 yuvipanda: reboot tools-services-01


=== 2016-12-14 ===
=== 2022-01-20 ===
* 18:43 mutante: tools-bastion-03 - ran 'locale-gen ko_KR.EUC-KR' for [[phab:T130532|T130532]]
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2016-12-13 ===
=== 2022-01-19 ===
* 20:54 chasemp: reboot bastion-03 as unresponsive
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move


=== 2016-12-09 ===
=== 2022-01-14 ===
* 19:32 godog: upgrade / restart prometheus-node-exporter
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]
* 08:37 YuviPanda: run delete-dbusers and force replica.my.cnf creation for all tools that did not have it


=== 2016-12-08 ===
=== 2022-01-12 ===
* 18:48 YuviPanda: restarted toolschecker on tools-checker-01
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2016-12-07 ===
=== 2022-01-04 ===
* 09:45 YuviPanda: restart redis on tools-proxy-02
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 09:32 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/324210 and https://gerrit.wikimedia.org/r/324211
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]
* 09:29 YuviPanda: clush -g k8s-worker -g k8s-master -g webproxy -b 'sudo puppet agent --disable "Deploying k8s change with alex"'


=== 2016-12-06 ===
==Archives==
* 00:36 bd808: Updated toollabs-webservice to 0.31 on rest of cluster ([[phab:T147350|T147350]])
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
 
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
=== 2016-12-05 ===
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* 23:19 bd808: Updated toollabs-webservice to 0.31 on tools-bastion-02 ([[phab:T147350|T147350]])
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
* 22:55 bd808: Updated jobutils to 1.17 on tools-mail ([[phab:T147350|T147350]])
* 22:53 bd808: Updated jobutils to 1.17 on tools-precise-dev ([[phab:T147350|T147350]])
* 22:53 bd808: Updated jobutils to 1.17 on tools-cron-01 ([[phab:T147350|T147350]])
* 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-03 ([[phab:T147350|T147350]])
* 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-02 ([[phab:T147350|T147350]])
* 16:53 bd808: Terminated deprecated instances: "tools-exec-1201", "tools-exec-1202", "tools-exec-1203", "tools-exec-1205", "tools-exec-1206", "tools-exec-1207", "tools-exec-1208", "tools-exec-1209", "tools-exec-1210", "tools-exec-1211" ([[phab:T151980|T151980]])
* 16:50 bd808: Released floating IPs from decommissioned tools-exec-12[01-11] instances
 
=== 2016-11-30 ===
* 23:06 bd808: Removed tools-exec-12[00-11] from gridengine ([[phab:T151980|T151980]])
* 22:54 bd808: Removed tools-exec-12[00-11] from @general hostgroup
* 15:17 chasemp: restart coibot 'coibot.sh -o syslog.output -e syslog.errors -r yes'
* 05:20 bd808: rescheduled continuous jobs on tools-exec-1210; 2 task queue jobs remain ([[phab:T151980|T151980]])
* 05:18 bd808: drained tools-exec-1211 ([[phab:T151980|T151980]])
* 05:14 bd808: drained tools-exec-1209 ([[phab:T151980|T151980]])
* 05:13 bd808: drained tools-exec-1208 ([[phab:T151980|T151980]])
* 05:12 bd808: drained tools-exec-1207 ([[phab:T151980|T151980]])
* 05:10 bd808: drained tools-exec-1206 ([[phab:T151980|T151980]])
* 05:07 bd808: drained tools-exec-1205 ([[phab:T151980|T151980]])
* 05:04 bd808: drained tools-exec-1204 ([[phab:T151980|T151980]])
* 05:00 bd808: drained tools-exec-1203 ([[phab:T151980|T151980]])
* 05:00 bd808: drained tools-exec-1202 ([[phab:T151980|T151980]])
* 04:58 bd808: disabled queues on tools-exec-1211 ([[phab:T151980|T151980]])
* 04:58 bd808: disabled queues on tools-exec-1210 ([[phab:T151980|T151980]])
* 04:58 bd808: disabled queues on tools-exec-1209 ([[phab:T151980|T151980]])
* 04:57 bd808: disabled queues on tools-exec-1208 ([[phab:T151980|T151980]])
* 04:57 bd808: disabled queues on tools-exec-1207 ([[phab:T151980|T151980]])
* 04:57 bd808: disabled queues on tools-exec-1206 ([[phab:T151980|T151980]])
* 04:56 bd808: disabled queues on tools-exec-1205 ([[phab:T151980|T151980]])
* 04:56 bd808: disabled queues on tools-exec-1204 ([[phab:T151980|T151980]])
* 04:56 bd808: disabled queues on tools-exec-1203 ([[phab:T151980|T151980]])
* 04:55 bd808: disabled queues on tools-exec-1202 ([[phab:T151980|T151980]])
* 04:52 bd808: drained tools-exec-1201 ([[phab:T151980|T151980]])
* 04:48 bd808: draining tools-exec-1201
 
=== 2016-11-29 ===
* 13:43 hashar: updating jouncebot so it properly reclaim its nick ( [[phab:T150916|T150916]] https://gerrit.wikimedia.org/r/#/c/324025/ )
 
=== 2016-11-22 ===
* {{SAL entry|1=15:13 chasemp: readd attr +i to replica.my.cnf that seems to have gotten lost in rsync migration}}
 
=== 2016-11-21 ===
* {{SAL entry|1=21:15 YuviPanda: disable puppet everywhere}}
* {{SAL entry|1=19:49 YuviPanda: restart all webservice jobs on gridengine to pick up logging again}}
 
=== 2016-11-20 ===
* {{SAL entry|1=06:51 Krenair: ran `qmod -rj lighttpd-admin` as tools.admin to try to get the main page back up, it worked briefly but then broke again}}
 
=== 2016-11-16 ===
* {{SAL entry|1=20:14 yuvipanda: upgrade toollabs-webservice to 0.30 on all webgrid nodes}}
* {{SAL entry|1=18:31 chasemp: reboot tools-exec-1404 (already depooled)}}
* {{SAL entry|1=18:19 chasemp: reboot tools-exec-1403}}
* {{SAL entry|1=17:23 chasemp: reboot tools-exec-1212 (converted via 321786 testing for recovery on boot)}}
* {{SAL entry|1=16:55 chasemp: clush -g all "puppet agent --disable 'trail run for changeset 321786 handling /var/lib/gridengine'"}}
* {{SAL entry|1=02:05 yuvipanda: rebooting tools-docker-registry-01, can't ssh in}}
* {{SAL entry|1=01:43 yuvipanda: cleanup old images on tools-docker-builder-03}}
 
=== 2016-11-15 ===
* {{SAL entry|1=19:52 chasemp: reboot tools-precise-dev}}
* {{SAL entry|1=05:20 yuvipanda: restart all k8s webservices too}}
* {{SAL entry|1=05:05 yuvipanda: restarting all webservices on gridengine}}
* {{SAL entry|1=03:21 chasemp: reboot tools-checker-01}}
* {{SAL entry|1=02:56 chasemp: reboot tools-exec-1405 to ensure noauto works (because atboot=>false is lies)}}
* {{SAL entry|1=02:31 chasemp: reboot tools-exec-1406}}
 
=== 2016-11-14 ===
* {{SAL entry|1=22:51 chasemp: shut down bastion 02 and 05 and make 03 root only}}
* {{SAL entry|1=19:35 madhuvishy: Stopped cron on tools-cron-01 (T146154)}}
* {{SAL entry|1=18:24 madhuvishy: Tools NFS is read-only. /data/project and /home across tools are ro T146154}}
* {{SAL entry|1=16:57 yuvipanda: stopped gridengine master}}
* {{SAL entry|1=16:47 yuvipanda: start restarting kubernetes webservice pods}}
* {{SAL entry|1=16:30 madhuvishy: Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154}}
* {{SAL entry|1=16:22 yuvipanda: kill maintain-kubeusers on tools-k8s-master-01, sole process touching NFS}}
* {{SAL entry|1=16:22 chasemp: enable puppet and run on tools-services-01}}
* {{SAL entry|1=16:21 yuvipanda: restarting all webservice jobs, watching webservicewatcher logs on tools-services-02}}
* {{SAL entry|1=16:14 madhuvishy: Disabling puppet across tools T146154}}
 
=== 2016-11-11 ===
* 20:49 madhuvishy: Dual mount of tools share complete. Puppet reenabled across tools hosts. T146154
* 20:18 madhuvishy: Rolling out dual mount of tools share across all hosts T146154
* 19:29 madhuvishy: Disabling puppet across tools to dual mount tools share from labstore-secondary T146154
 
=== 2016-11-02 ===
* 18:23 yuvipanda: manually stop tools-grid-master for reboot
* 17:42 yuvipanda: drain nodes from labvirt1012 and 13
* 13:42 chasemp: depool tools-exec-1404 for maint
 
=== 2016-11-01 ===
* 21:54 yuvipanda: stop gridengine-master on tools-grid-master in preparation for reboot
* 21:34 yuvipanda: depool tools nodes on labvirt1012
* 21:16 yuvipanda: depool things in labvirt1011
* 20:58 yuvipanda: depool tools nodes on labvirt1010
* 20:32 yuvipanda: depool tools things on labvirt1005 and 1009
* 20:08 yuvipanda: depooled things on labvirt1006 and 1008
* 19:51 yuvipanda: move tools-elastic-03 to labvirt1010, -02 already in 09
* 19:34 yuvipanda: migrate tools-elastic-03 to labvirt1009
* 19:10 yuvipanda: depooled tools nodes from labvirt1004 and 1007
* 17:57 yuvipanda: depool exec nodes on labvirt1002
* 13:27 chasemp: reboot tools-exec-1404 post depool for test
 
=== 2016-10-31 ===
* 21:50 yuvipanda: deleted cyberbot queue with qconf -dq cyberbot
* 21:44 yuvipanda: restarted cron on tools-cron-01
 
=== 2016-10-30 ===
* 02:25 yuvipanda: restarted maintain-kubeusers
 
=== 2016-10-29 ===
* 17:21 yuvipanda: depool tools-worker-1005
 
=== 2016-10-28 ===
* 20:15 chasemp: restart prometheus service on tools-prometheus-01 to see if that wakes it up
* 20:06 yuvipanda: restart kube-apiserver again, ran into too many open file handles
* 15:58 Yuvi[m]: restart k8s master, seems to have run out of fds
* 15:43 chasemp: restart toolschecker service on 01 and 02
 
=== 2016-10-27 ===
* 21:09 godog: upgrade prometheus on tools-prometheus0[12]
* 18:49 andrewbogott: rebooting  tools-webgrid-lighttpd-1401
* 13:51 chasemp: reboot tools-webgrid-generic-1403
* 13:50 chasemp: reboot dockerbuilder-01
 
=== 2016-10-26 ===
* 23:20 madhuvishy: Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638
* 23:17 godog: upgrade prometheus on tools-prometheus-02
* 16:52 bd808: Deployed jobutils_1.16_all.deb on tools-mail (default jsub target to trusty)
* 16:50 bd808: Deployed jobutils_1.16_all.deb on tools-precise-dev (default jsub target to trusty)
* 16:48 bd808: Deployed jobutils_1.16_all.deb on tools-bastion-02, tools-bastion-03, tools-cron-01 (default jsub target to trusty)
 
=== 2016-10-25 ===
* 18:48 yuvipanda: repool all depooled instances
* 04:19 yuvipanda: reboot tools-flannel-etcd-01 for https://phabricator.wikimedia.org/T149072#2741012
 
=== 2016-10-24 ===
* 03:45 Krenair: reset host keys for tools-puppetmaster-02 on -01, looks like it was recreated 5-6 days ago
 
=== 2016-10-20 ===
* 16:55 yuvipanda: killed bzip2 taking 100% CPU on tools-bastion-03
 
=== 2016-10-18 ===
* 22:56 Guest20046: flip tools-k8s-master-01 to tools-puppetmaster-02
* 07:43 yuvipanda: move all tools webgrid nodes to tools-puppetmaster-02 too
* 07:40 yuvipanda: complete moving all general tools exec nodes to tools-puppetmaster-02
* 07:33 yuvipanda: restarted puppetmaster on tools-puppetmaster-01
 
=== 2016-10-17 ===
* 14:37 chasemp: remove bdsync-deb and bdsync-deb-2 errornously created in Tools and now defunct anyway
* 14:05 chasemp: restart puppetmaster on tools-puppetmaster-01 (instances sticking on puppet runs for a long time)
* 14:01 chasemp: reboot tools-exec-1215 and tools-exec-1410 as unresponsive
 
=== 2016-10-14 ===
* 16:20 yuvipanda: repoooled tools-worker-1012, seems to have recovered?!
* 15:57 yuvipanda: drain tools-worker-1012, seems stuck
 
=== 2016-10-10 ===
* 18:04 valhallasw`vecto: sudo service bigbrother restart @ tools-services-02
 
=== 2016-10-09 ===
* 18:33 valhallasw`cloud: removed empty local crontabs for {yuvipanda, yuvipanda, tools.toolschecker} on {tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1204, tools-checker-01}. No other local crontabs remaining.
 
=== 2016-10-05 ===
* 12:15 chasemp: reboot tools-webgrid-generic-1404 as locked up
 
=== 2016-10-01 ===
* 10:03 yuvipanda: re-enable puppet on tools-checker-02
 
=== 2016-09-29 ===
* 18:15 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs via wikitech; couldn't ssh in
* 18:10 bd808: Investigating elasticsearch cluster issues effecting stashbot
 
=== 2016-09-27 ===
* 08:07 chasemp: tools-bastion-03:~# chmod 640 /var/log/syslog
 
=== 2016-09-25 ===
* 15:27 Krenair: restarted labs-logbot under tools.morebots
 
=== 2016-09-21 ===
* 18:56 madhuvishy: Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup
* 18:42 madhuvishy: Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
* 16:57 chasemp: reboot tools-webgrid-lighttpd-1407, tools-webgrid-lighttpd-1210, tools-webgrid-lighttpd-1414, and then tools-webgrid-lighttpd-1405 as the first 3 return
 
=== 2016-09-20 ===
* 23:24 yuvipanda: depool tools-webgrid-lighttpd-1416 and 1418, they aren't in actual working order
* 21:23 madhuvishy|food: Pooled new sge exec node  tools-webgrid-lighttpd-1416 (T146212)
* 21:17 madhuvishy|food: Pooled new sge exec node  tools-webgrid-lighttpd-1415 (T146212)
* 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1418 (T146212)
* 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1416 (T146212)
* 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1415 (T146212)
* 17:58 andrewbogott: reboot tools-exec-1410
* 17:54 yuvipanda: repool tools-webgrid-lighttpd-1412
* 17:49 yuvipanda: webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting
* 17:33 yuvipanda: reboot tools-puppetmaster-01
* 17:20 yuvipanda: reboot tools-checker-02
* 15:42 chasemp: move floating ip from tools-checker-02 (failed) to tools-checker-01
 
=== 2016-09-13 ===
* 21:09 madhuvishy: Bumped proxy nginx worker_connections limit T143637
* 21:08 madhuvishy: Reenabled puppet across proxy hosts
* 20:44 madhuvishy: Disabling puppet across proxy hosts
 
=== 2016-09-12 ===
* 18:33 bd808: Forcing puppet run on tools-cron-01
* 18:31 bd808: Forcing puppet run on tools-bastion-03
* 18:28 bd808: Forcing puppet run on tools-bastion-02
* 18:26 bd808: Forcing puppet run on tools-precise-dev
* 18:26 bd808: Built toollabs-webservice v0.27 package and added to aptly
 
=== 2016-09-10 ===
* 01:06 yuvipanda: migrate tools-k8s-etcd-01 to labvirt1012, is in state doing no io
 
=== 2016-09-09 ===
* 19:27 yuvipanda: reboot tools-exec-1218 and 1219
* 18:10 yuvipanda: killed massive grep running as root
 
=== 2016-09-08 ===
* 21:49 bd808: forcing puppet runs to install toollabs-webservice_0.26_all.deb
* 20:51 bd808: forcing puppet runs to install jobutils_1.15_all.deb
 
=== 2016-09-07 ===
* 21:11 Krenair: brought labs/private.git up to date on tools-puppetmaster-01
* 02:32 Krenair: ran `SULWatcher/restart_SULWatcher.sh` as `tools.stewardbots` on bastion-03 to fix T144887
 
=== 2016-09-06 ===
* 22:14 yuvipanda: got pbuilder off tools-services-01, was taking up too much space.
* 22:10 madhuvishy: Deleted instance tools-web-static-01 and tools-web-static-02 (T143637)
* 21:45 yuvipanda: reboot tools-prometheus-02. nova diagnostics shows no vda activity.
* 20:43 chasemp: drain and reboot tools-exec-1410 for testing
* 07:32 yuvipanda: depooled tools-exec-1219 and 1218, seem to be unresponsive, causing jobs that appear to run but aren't really
 
=== 2016-09-05 ===
* 16:27 andrewbogott: rebooting tools-cron-01 because it is hanging all over the place
 
=== 2016-09-01 ===
* 05:19 yuvipanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck
 
=== 2016-08-31 ===
* 20:48 madhuvishy: Reenabled puppet across tools hosts
* 20:45 madhuvishy: Scratch migration complete on all grid exec nodes (T134896)
* 19:36 madhuvishy: Scratch migration on all non exec/worker nodes complete (T134896)
* 18:18 madhuvishy: Scratch migration complete for all k8s workers (T134896)
* 17:50 madhuvishy: Reenabling puppet across tools hosts.
* 16:55 madhuvishy: Rsync-ed over latest backup of /srv/scratch from labstore1001 to labstore1003
* 16:50 madhuvishy: Puppet disabling complete (T134896)
 
=== 2016-08-30 ===
* 18:54 valhallasw`cloud: edited /etc/shadow on a range of hosts to fix https://phabricator.wikimedia.org/T143191
* 10:59 godog: bounce stashbot, not seen on irc
 
=== 2016-08-29 ===
* 23:38 Krenair: added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now
* 16:35 yuvipanda: run chmod u+x /data/project/framabot
* 13:40 chasemp: restart jouncebot
 
=== 2016-08-28 ===
* 05:34 bd808: After git gc on web-static-02.tools:/srv/cdnjs: /dev/mapper/vd-cdnjs--disk  61G  54G  3.3G  95% /srv
* 05:25 bd808: sudo git gc --aggressive on tools-web-static-01.tools:/srv/cdnjs
* 04:56 bd808: sudo git gc --aggressive on tools-web-static-02.tools:/srv/cdnjs
 
=== 2016-08-26 ===
* 16:53 yuvipanda: migrate tools-static-02 to labvirt1001
 
=== 2016-08-25 ===
* 18:07 yuvipanda: restart puppetmaster on tools-puppetmaster-01
* 17:41 yuvipanda: depooled tools-webgrid-1413
* 01:16 yuvipanda: restarted puppetmaster on tools-puppetmaster-01
 
=== 2016-08-24 ===
* 23:03 chasemp: reboot tools-exec-1217
* 17:25 yuvipanda: depool tools-exec-1217, it is dead/stuck/hung/io-starved
 
=== 2016-08-23 ===
* 07:08 madhuvishy: Enabled puppet across tools after merging https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896)
* 05:48 yuvipanda: restarted nginx on tools-proxy-01, was out of connection slots
 
=== 2016-08-22 ===
* 22:07 madhuvishy: Disabled puppet across tools hosts in preparation to merge https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896)
* 22:01 madhuvishy: Disabling puppet across tools hosts
 
=== 2016-08-20 ===
* 11:42 valhallasw`cloud: rebooting tools-mail (hanging)
 
=== 2016-08-19 ===
* 14:52 chasemp: reboot 82323ee4-762e-4b1f-87a7-d7aa7afa22f6
 
=== 2016-08-18 ===
* 20:00 yuvipanda: restarted maintain-kubeusers on tools-k8s-master-01
 
=== 2016-08-15 ===
* 22:10 yuvipanda: depool tools-exec-1211 and 1205, seem to be out of action
* 19:12 yuvipanda: kill unused tools-merlbot-proxy
 
=== 2016-08-12 ===
* 20:39 yuvipanda: delete tools-webgrid-lighttpd-1415, enough webservices have moved to k8s from that queue
* 20:37 yuvipanda: delete tools-logs-01, going to recreate with a smaller image
* 20:36 yuvipanda: delete tools-webgrid-generic-1405, enough things have moved to k8s from that queue!
* 20:10 yuvipanda: migration of tools-grid-master to labvirt1013 complete
* 20:01 yuvipanda: migrating tools-grid-master (currently inactive) to labvirt1013 away from crowded 1010
* 12:40 chasemp: tools.templatetransclusioncheck@tools-bastion-03:~$ webservice restart
 
=== 2016-08-11 ===
* 20:13 yuvipanda: tools-grid-master finally stopped
* 20:05 yuvipanda: disabled tools-webgrid-lighttpd-1202, is hung
* 17:23 yuvipanda: instance being rebooted is tools-grid-master
* 17:22 chasemp: reboot via nova master as it is stuck
 
=== 2016-08-05 ===
* 19:29 paladox: adding tom29739 to lolrrit-wm project
 
=== 2016-08-04 ===
* 19:09 yuvipanda: cleaned up nginx log files in tools-docker-registry-01 to fix free space warning
* 00:19 yuvipanda: added Krenair as admin to help with T132225 and other issues.
 
=== 2016-08-03 ===
* 22:48 yuvipanda: deleted tools-worker-1005
* 22:08 yuvipanda: depool & delete tools-worker-1007 and 1008
* 21:34 yuvipanda: rebooting tools-puppetmaster-01 to test a hypothesis
* 21:10 yuvipanda: rebooting tools-puppetmaster-01 for kernel upgrade
* 00:20 madhuvishy: Repooled nodes tools-worker 1012 and 1013 for T141126
 
=== 2016-08-02 ===
* 22:49 yuvipanda: depooled tools-worker-1014 as well for T141126
* 22:44 yuvipanda: depool tools-worker-1015 for T141126
* 22:42 paladox: cherry picking 302617 onto lolrrit-wm
* 22:41 madhuvishy: Depooling tools-worker 1012 and 1013 for T141126
* 22:32 yuvipanda: added paladox to tools
* 09:38 godog: bounce morebots production
* 00:01 yuvipanda: depool tools-worker-1017 for T141126
 
=== 2016-08-01 ===
* 23:48 madhuvishy: Repooled tools-worker-1011 and tools-worker-1018 (Yuvi) for T114126
* 23:41 madhuvishy: Repooled tools-worker-1010 and tools-worker-1019 (Yuvi) for T114126
* 23:21 madhuvishy: Yuvi is depooling tools-worker-1018 for T114126
* 23:19 madhuvishy: Depooling tools-worker 1010 and 1011 for T114126
* 23:17 madhuvishy: Yuvi depooled tools-worker-1019 for T114126
* 23:06 madhuvishy: Added tools-worker-1022 as new k8s worker node
* 23:06 madhuvishy: Repooled tools-worker-1009 (T114126)
* 22:48 madhuvishy: Depooling tools-worker-1009 to prepare for T141126
 
=== 2016-07-29 ===
* 22:04 YuviPanda: repooled tools-worker-1006
* 21:48 YuviPanda: deleted tools-worker-1006 after depooling+draining
* 21:45 YuviPanda: repool new tools-worker-1003 with direct-lvm docker storage backend
* 21:30 YuviPanda: depool tools-worker-1003 to be recreated with new docker config, picking this because it's on a non-ssd host
* 21:17 YuviPanda: depooled tools-worker-1020/21 after fixing them up
* 20:41 YuviPanda: delete tools-worker-1001
* 20:29 YuviPanda: depool tools-worker-1001, going to recreate with to test new puppet deploying-first-run
* 20:26 YuviPanda: built new worker nodes tools-worker-1020 and 21 with direct-lvm storage backend
* 17:48 YuviPanda: disable puppet on all tools k8s worker nodes
 
=== 2016-07-25 ===
* 14:17 chasemp: nova reboot 64f01f90-c805-4a2e-9ed5-f523b909094e (grid master)
 
=== 2016-07-23 ===
* 23:21 YuviPanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck on connecting to seaborgium preventing new tool creation
* 01:56 YuviPanda: deploy kubernetes v1.3.3wmf1
 
=== 2016-07-22 ===
* 17:30 YuviPanda: repool tools-worker-1018
* 14:04 chasemp: reboot tools-worker-1015 as stuck w/ high iowait warning seconds ago.  I cannot ssh in as root.
 
=== 2016-07-21 ===
* 22:42 chasemp: reboot tools-worker-1018 as stuck T141017
 
=== 2016-07-20 ===
* 21:27 andrewbogott: rebooting tools-k8s-etcd-01
* 11:14 Guest9334: rebooted tools-worker-1004
 
=== 2016-07-19 ===
* 01:06 bd808: Upgraded Elasticsearch on tools-elastic-* to 2.3.4
 
=== 2016-07-18 ===
* 21:50 YuviPanda: force downgrade hhvm on tools-webgrid-lighttpd-1408 to fix puppet issues
* 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm on tools-worker-1004
* 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm
* 21:37 YuviPanda: killed tools-pastion-01, no longer in use
* 20:59 bd808: Disabled puppet on tools-elastic-0[123]. Elasticsearch needs to be upgraded.
* 15:15 YuviPanda: kill 8807036 for Luke081515
* 12:48 YuviPanda: reboot tools-flannel-etcd-03 for T140256
* 12:41 YuviPanda: reboot tools-k8s-etcd-02 for T140256
 
=== 2016-07-15 ===
* 10:24 yuvipanda: depool tools-exec-1402 for T138447
* 10:24 yuvipanda: reboot tools-exec-1402 for T138447
* 10:16 yuvipanda: depooling tools-webgrid-lighttpd-1402 and -1412 since they seem to be suffering from T138447
* 10:08 yuvipanda: reboot tools-webgrid-lighttpd-1402 and 1412
 
=== 2016-07-14 ===
* 23:12 bd808: Added Madhuvishy to project "roots" sudoer list
* 22:58 bd808: Added Madhuvishy as projectadmin
* 21:25 chasemp: change perms for tools.readmore to correct bot
 
=== 2016-07-13 ===
* 11:40 yuvipanda: cold-migrate tools-worker-1014 off labvirt1010 to see if that improves the ksoftirqd situation
* 11:19 yuvipanda: drained tools-worker-1004 - high ksoftirqd usage even with no load
* 11:13 yuvipanda: depool tools-worker-1014 - unusable, totally in iowait
* 11:13 yuvipanda: reboot tools-worker-1004, was unresponsive
 
=== 2016-07-12 ===
* 18:07 yuvipanda: reboot tools-worker-1012, it seems to have failed LDAP connectivity :|
 
=== 2016-07-08 ===
* 12:38 yuvipanda: starting up tools-web-static-02 again
 
=== 2016-07-07 ===
* 12:45 yuvipanda: start deployment of k8s 1.3.0wmf4 for T139259
 
=== 2016-07-06 ===
* 13:09 yuvipanda: associated a floating IP with tools-k8s-master-01 for T139461
* 11:47 yuvipanda: moved tools-checker-0[12] to use tools-puppetmaster-01 as puppetmaster so they get appropriate CA for use when talking to kubernetes API
 
=== 2016-07-04 ===
* 11:13 yuvipanda: delete tools-prometheus-01 to free up resources on labvirt1010
* 11:11 yuvipanda: actually deleted instance tools-cron-02 to free up resources on labvirt1010 - was large and not currently used, and failover process takes a while anyway, so we can recreate if needed
* 11:11 yuvipanda: stopped instance tools-cron-02 to free up some resources on labvirt1010
 
=== 2016-07-03 ===
* 17:09 yuvipanda: run qstat -u '*' | grep 'dr ' | awk '{ print $1;}' | xargs -L1 qdel -f to clean out jobs stuck in dr state
* 16:59 yuvipanda: migrate tools-web-static-02 to labvirt1011 to provide more breathing room
* 16:56 yuvipanda: delete temp-test-trusty-package to provide more breathing room on labvirt1010
* 13:49 yuvipanda: reboot tools-exec-1219
* 13:37 yuvipanda: migrating tools-exec-1216 to labvirt1011
* 13:07 yuvipanda: delete tools-bastion-01 which was shut down anyway
* 13:04 yuvipanda: attempt to reboot tools-exec-1212
 
=== 2016-06-28 ===
* 15:25 bd808: Signed client cert for tools-worker-1019.tools.eqiad.wmflabs on tools-puppetmaster-01.tools.eqiad.wmflabs
 
=== 2016-06-21 ===
* 16:49 bd808: Updated jobutils to v1.14 for T138178
 
=== 2016-06-17 ===
* 06:17 yuvipanda: forced deletion of 7033590 for dykbot for shubinator
 
=== 2016-06-08 ===
* 20:31 yuvipanda: start tools-bastion-03 was stuck in 'stopped' state
* 20:31 yuvipanda: reboot tools-bastion-03
 
=== 2016-05-31 ===
* 17:35 valhallasw`cloud: re-enabled queues on  tools-exec-1407, tools-exec-1216, tools-exec-1219
* 13:13 chasemp: reboot of tools-exec-1203 see T136495 all jobs seem gone now
 
=== 2016-05-30 ===
* 13:06 valhallasw`cloud: rebooting tools-exec-1221
* 11:53 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652 https://gerrit.wikimedia.org/r/#/c/290479 https://gerrit.wikimedia.org/r/#/c/291710/ on tools-puppetmaster-01
 
=== 2016-05-29 ===
* 18:58 YuviPanda: deleted tools-k8s-bastion-01 for T136496
* 14:29 valhallasw`cloud: chowned /data/project/xtools-mab-dev to root and back to stop rogue process that was writing to the directory. I'm still not sure where that process  was running, but at least this seems to have solved the issue
 
=== 2016-05-28 ===
* 21:52 valhallasw`cloud: rebooted tools-webgrid-lighttpd-1408, tools-pastion-01, tools-exec-1205
* 21:21 valhallasw`cloud: rebooting tools-exec-1204 (T136495)
 
=== 2016-05-27 ===
* 14:45 YuviPanda: start moving tools-bastion-03 to use tools-puppetmaster-01 as puppetmaster
 
=== 2016-05-25 ===
* 20:15 YuviPanda: deleted tools-bastion-mtemp per chasemp
* 19:43 YuviPanda: delete devpi instance, not currently in use
* 19:39 YuviPanda: run  sudo dpkg --configure -a on tools-worker-1007 to get it unstuck
* 19:19 YuviPanda: deleted tools-docker-builder-01 and -02, hosed hosts that are unused
* 17:18 YuviPanda: fixed hhvm upgrade on tools-cron-01
* 07:19 YuviPanda: hard reboot tools-services-01, was completely stuck on /public/dumps
* 06:06 bd808: Restarting all webservice jobs
* 05:33 andrewbogott: rebooting tools-proxy-02
 
=== 2016-05-24 ===
* 01:36 scfc_de: tools-cron-02: Downgraded hhvm (sudo apt-get install hhvm).
* 01:36 scfc_de: tools-bastion-03, tools-checker-01, tools-cron-02, tools-exec-1202, tools-proxy-02, tools-redis-1001: Remounted /public/dumps read-only (while sudo umount /public/dumps; do :; done && sudo puppet agent -t).
 
=== 2016-05-23 ===
* 19:36 YuviPanda: switched tools-checker to tools-checker-03
* 16:33 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs
* 13:28 chasemp: 'apt-get install hhvm -y --force-yes' across trusty hosts to handle hhvm downgrade
 
=== 2016-05-20 ===
* 23:39 bd808: Forced puppet run on bastion-02 & bastion-05 to apply fix for T135861
* 19:47 chasemp: tools-exec-1406 having issues rebooting
 
=== 2016-05-19 ===
* 21:07 bd808: deployed jobutils 1.13 on bastions; now with '-l release=...' validation!
* 15:43 YuviPanda: rebooting all tools worker instances
* 13:12 chasemp: reboot tools-exec-1220 stuck in state of unresponsivenss
 
=== 2016-05-13 ===
* 00:40 YuviPanda: cleared all queues that were in error state
 
=== 2016-05-12 ===
* 22:59 YuviPanda: restart tools-worker-1004 to attempt bringing it back up
* 22:59 YuviPanda: deploy k8s 1.2.4wmf1 on all proxy nodes
* 22:58 YuviPanda: deploy k8s on all worker nodes
* 22:46 YuviPanda: deploy k8s master for 1.2.4wmf1
 
=== 2016-05-10 ===
* 04:25 bd808: Added role::package::builder to tools-services-01
 
=== 2016-05-09 ===
* 04:33 YuviPanda: reboot tools-worker-1004, lots of ksoftirqd stuckness despite no actual containers running
 
=== 2016-05-08 ===
* 07:06 YuviPanda: restarted admin tool
 
=== 2016-05-05 ===
* 13:11 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652/ on puppetmaster
 
=== 2016-04-28 ===
* 04:15 YuviPanda: delete half of the trusty webservice jobs
* 04:00 YuviPanda: deleted all precise webservice jobs, waiting for webservicemonitor to bring them back up
 
=== 2016-04-24 ===
* 12:22 YuviPanda: force deleted job 5435259 from pbbot per PeterBowman
 
=== 2016-04-11 ===
* 14:20 andrewbogott: moving tools-bastion-mtemp to labvirt1009
 
=== 2016-04-06 ===
* 15:20 bd808: Removed local hack for T131906 from tools-puppetmaster-01
 
=== 2016-04-05 ===
* 21:24 bd808: Committed local hack on tools-puppetmaster-01 to get elasticsearch working again
* 21:02 bd808: Forcing puppet runs to fix elasticsearch
* 20:39 bd808: Elasticsearch processes down. Looks like a prod puppet change that needs tweaking for tool labs
 
=== 2016-04-04 ===
* 19:43 YuviPanda: new bastion!
* 19:15 chasemp: reboot tools-bastion-05
 
=== 2016-03-30 ===
* 15:50 andrewbogott: rebooting tools-proxy-01 in hopes of clearing some bad caches
 
=== 2016-03-28 ===
* 20:51 yuvipanda: lifted RAM quota from 900Gigs to 1TB?!
* 20:30 chasemp: change perm grant files from create-dbusers for chmod 400 chat chattr +i
 
=== 2016-03-27 ===
* 17:40 scfc_de: tools-webgrid-generic-1405, tools-webgrid-lighttpd-1411, tools-web-static-01, tools-web-static-02: "apt-get install cloud-init" and accepted changes for /etc/cloud/cloud.cfg (users: + default; cloud_config_modules: + ssh-import-id, + puppet, + chef, + salt-minion; system_info/package_mirrors/arches[i386, amd64]/search/primary: + http://%(region)s.clouds.archive.ubuntu.com/ubuntu/).
 
=== 2016-03-18 ===
* 15:47 chasemp: had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs
* 15:36 chasemp: cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.*
 
=== 2016-03-11 ===
* 20:57 mutante: reverted font changes - puppet runs recovering
* 20:37 mutante: more puppet issues due to font dependencies on trusty, on it
* 19:39 mutante: should a tools-exec server be influenced by font packages on an mw appserver?
* 19:39 mutante: fixed puppet runs on tools-exec (gerrit 276792)
 
=== 2016-03-02 ===
* 14:56 chasemp: qdel 3956069 and 3758653 for abusing auth
 
=== 2016-02-29 ===
* 21:49 scfc_de: tools-exec-1218: rm -f /usr/local/lib/nagios/plugins/check_eth to work around "Got passed new contents for sum" (https://tickets.puppetlabs.com/browse/PUP-1334).
* 21:20 scfc_de: tools-exec-1209: rm -f /var/lib/puppet/state/agent_catalog_run.lock (no Puppet process running, probably from the reboots).
* 20:58 scfc_de: Ran "dpkg --configure -a" on all instances.
* 13:50 scfc_de: Deployed jobutils/misctools 1.10.
 
=== 2016-02-28 ===
* 20:08 bd808: Removed unwanted NFS mounts from tools-elastic-01.tools.eqiad.wmflabs
 
=== 2016-02-26 ===
* 19:08 bd808: Upgraded Elasticsearch on tools-elastic-0[123] to 1.7.5
 
=== 2016-02-25 ===
* 21:43 scfc_de: Deployed jobutils/misctools 1.9.
 
=== 2016-02-24 ===
* 19:46 chasemp: runonce deployed for https://gerrit.wikimedia.org/r/#/c/272891/
 
=== 2016-02-22 ===
* 15:55 andrewbogott: redirecting tools-login.wmflabs.org to tools-bastion-05
 
=== 2016-02-19 ===
* 15:58 chasemp: rerollout tools nfs shaping pilot for sanity in anticipation of formalization
* 09:21 _joe_: killed cluebot3 instance on tools-exec-1207, writing 20 M/s to the error log
* 00:50 yuvipanda: failover services to services-02
 
=== 2016-02-18 ===
* 20:37 yuvipanda: failover proxy back to tools-proxy-01
* 19:46 chasemp: repool labvirt1003 and depool labvirt1004
* 18:19 chasemp: draining nodes from labvirt1001
 
=== 2016-02-16 ===
* 21:33 chasemp: reboot of bastion-1002
 
=== 2016-02-12 ===
* 19:56 chasemp: nfs traffic shaping pilot round 2
 
=== 2016-02-05 ===
* 22:01 chasemp: throttle some vm nfs write speeds
* 16:49 scfc_de: find /data/project/wikidata-edits -group ssh-key-ldap-lookup -exec chgrp tools.wikidata-edits \{\} + (probably a remnant of the work on ssh-key-ldap-lookup last summer).
* 16:45 scfc_de: Removed /data/project/test300 (uid/gid 52080; none of them resolves, no databases, just an unmodified pywikipedia clone inside).
 
=== 2016-02-03 ===
* 03:00 YuviPanda: upgraded flannel on all hosts running it
 
=== 2016-01-31 ===
* 20:01 scfc_de: tools-webgrid-generic-1405: Rebooted via wikitech; rebooting via "shutdown -r now" did not seem to work.
* 18:51 bd808: tools-elastic-01.tools.eqiad.wmflabs console shows blocked tasks, possible kernel bug?
* 18:49 bd808: tools-elastic-01.tools.eqiad.wmflabs not responsive to ssh or Elasticsearch requests; rebooting via wikitech interface
* 13:32 hashar: restarted qamorebot
 
=== 2016-01-30 ===
* 06:38 scfc_de: tools-webgrid-generic-1405: Rebooted for load ~ 175 and lots of processes stuck in D.
 
=== 2016-01-29 ===
* 21:25 YuviPanda: restarted image-resize-calc manually, no service.manifest file
 
=== 2016-01-28 ===
* 15:02 scfc_de: tools-cron-01: Rebooted via wikitech as "shutdown -r now" => "@sbin/plymouthd --mode=shutdown" => "/bin/sh -e /proc/self/fd/9" => "/bin/sh /etc/init.d/rc 6" => "/bin/sh /etc/rc6.d/S20sendsigs stop" => "sync" stuck in D.  *argl*
* 14:56 scfc_de: tools-cron-01: Rebooted due to high number of processes stuck in D and load >> 100.
* 14:54 scfc_de: tools-cron-01: HUPped 43 processes wikitrends/refresh.sh, though a lot of all processes seem to be stuck in D, so I'll reboot this instance.
* 14:50 scfc_de: tools-cron-01: HUPped 85 processes /usr/lib/php5/sessionclean.
 
=== 2016-01-27 ===
* 23:07 YuviPanda: removed all members of templatetiger, added self instead, removed active shell sessions
* 20:24 chasemp: master stop, truncate accounting log to accounting.01272016, master start
* 19:34 chasemp: master start grid master
* 19:23 chasemp: stopped master
* 19:11 YuviPanda: depooled tools-webgrid-1405 to prep for restart, lots of stuck processes
* 18:29 valhallasw`cloud: job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 .
* 18:26 valhallasw`cloud: messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate
* 18:24 valhallasw`cloud: 'sleep' test job also seems to work without issues
* 18:23 valhallasw`cloud: no errors in log file, qstat works
* 18:23 chasemp: master sge restarted post dump and restart for jobs db
* 18:22 valhallasw`cloud: messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016'
* 18:20 chasemp: master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job
* 18:19 valhallasw`cloud: dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M
* 18:17 valhallasw`cloud: SGE Configuration successfully saved to /root/sge_maint_01272016 directory.
* 18:14 chasemp: grid master stopped
* 00:56 scfc_de: Deployed admin/www bde15df..12a3586.
 
=== 2016-01-26 ===
* 21:28 YuviPanda:  qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj
* 21:16 chasemp: reboot tools-exec-1217.tools.eqiad.wmflabs
 
=== 2016-01-25 ===
* 20:30 YuviPanda: switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01
* 19:06 chasemp: kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs
* 17:07 scfc_de: Deployed admin/www at bde15df2a379c33edfb8350afd2f0c7186705a93.
 
=== 2016-01-23 ===
* 15:49 scfc_de: Removed remnant send_puppet_failure_emails cron entries except from unreachable hosts sacrificial-kitten, tools-worker-06 and tools-worker-1003.
 
=== 2016-01-21 ===
* 22:24 YuviPanda: deleted tools-redis-01 and -02 (are on 1001 and 1002 now)
* 21:13 YuviPanda: repooled exec nodes on labvirt1010
* 21:08 YuviPanda: gridengine-master started, verified shadow hasn't started
* 21:00 YuviPanda: stop gridengine master
* 20:51 YuviPanda: repooled exec nodes on labvirt1007 was last message
* 20:51 YuviPanda: repooled exec nodes on labvirt1006
* 20:39 YuviPanda: failover tools-static too tools-web-static-01
* 20:38 YuviPanda: failover tools-checker to tools-checker-01
* 20:32 YuviPanda: depooled exec nodes on 1007
* 20:32 YuviPanda: repooled exec nodes on 1006
* 20:14 YuviPanda: depooled all exec nodes in labvirt1006
* 20:11 YuviPanda: repooled exec node son 1005
* 19:53 YuviPanda: depooled exec nodes on labvirt1005
* 19:49 YuviPanda: repooled exec nodes from labvirt1004
* 19:48 YuviPanda: failed over proxy to tools-proxy-01 again
* 19:31 YuviPanda: depooled exec nodes from labvirt1004
* 19:29 YuviPanda: repooled exec nodes from labvirt1003
* 19:13 YuviPanda: depooled instances on labvirt1003
* 19:06 YuviPanda: re-enabled queues on exec nodes that were on labvirt1002
* 19:02 YuviPanda: failed over tools proxy to tools-proxy-02
* 18:46 YuviPanda: drained and disabled queues on all nodes on labvirt1002
* 18:38 YuviPanda: restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead
 
=== 2016-01-12 ===
* 09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).
 
=== 2016-01-11 ===
* 22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
* 22:12 YuviPanda: restarted gridengine master again
* 22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
* 22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
* 21:57 valhallasw`cloud: reset to 7:30
* 21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
* 21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
* 21:45 YuviPanda: restarted gridengine master
* 21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
* 21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
* 21:41 valhallasw`cloud: currently 353 jobs in qw state
* 21:40 valhallasw`cloud: that's load_adjustment_decay_time
* 21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
* 19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
* 17:51 YuviPanda: kill all queries running on labsdb1003
* 17:20 YuviPanda: stopped webservice for quentinv57-tools
 
=== 2016-01-09 ===
* 21:07 valhallasw`cloud: moved tools-checker/208.80.155.229  back to tools-checker-01
* 21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
* 13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.
 
=== 2016-01-08 ===
* 19:46 chasemp: couldn't get into tools-mail-01 at all and it seemed borked so I rebooted
* 17:23 andrewbogott: killing tools.icelab as per https://wikitech.wikimedia.org/wiki/User_talk:Torin#Running_queries_on_tools-dev_.28tools-bastion-02.29
 
=== 2015-12-30 ===
* 04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
* 03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
* 02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
* 00:40 YuviPanda: restarted master on grid-master
* 00:40 YuviPanda: copied and cleaned out spooldb
* 00:10 YuviPanda: reboot tools-grid-shadow
* 00:08 YuviPanda: attempt to stop shadowd
* 00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
* 00:00 YuviPanda: kill -9'd gridengine master
 
=== 2015-12-29 ===
* 23:31 YuviPanda: rebooting tools-grid-master
* 23:22 YuviPanda: restart gridengine-master on tools-grid-master
* 00:18 YuviPanda: shut down redis on tools-redis-01
 
=== 2015-12-28 ===
* 22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
* 22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
* 21:32 YuviPanda: disable puppet on tools-redis-01 and -02
* 21:27 YuviPanda: created tools-redis-1001
 
=== 2015-12-23 ===
* 21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
* 21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
* 19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
* 18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
* 18:40 valhallasw`cloud: scratch that, first going to eat dinner
* 18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
* 14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
* 10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel
 
=== 2015-12-22 ===
* 18:30 YuviPanda: rescheduling all webservices
* 18:17 YuviPanda: failed over active proxy to proxy-01
* 18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
* 01:42 YuviPanda: rebooting tools-worker-08
 
=== 2015-12-21 ===
* 18:44 YuviPanda: reboot tools-proxy-01
* 18:31 YuviPanda: failover proxy to tools-proxy-02
 
=== 2015-12-20 ===
* 00:00 YuviPanda: tools-worker-08 stuck again :|
 
=== 2015-12-18 ===
* 15:16 andrewbogott: rebooting locked up host tools-exec-1409
 
=== 2015-12-16 ===
* 23:14 andrewbogott: rebooting  tools-exec-1407, unresponsive
* 22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
* 21:28 andrewbogott: deleted tools-docker-registry-01
* 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup
 
=== 2015-12-12 ===
* 10:08 YuviPanda: restarted cron on tools-submit
 
=== 2015-12-10 ===
* 12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.
 
=== 2015-12-07 ===
* 13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
* 10:46 YuviPanda: restarted nscd on tools-proxy-01
 
=== 2015-12-06 ===
* 10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest
 
=== 2015-12-04 ===
* 19:33 Coren: switching master role to tools-grid-master
* 04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
* 04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01
 
=== 2015-12-02 ===
* 18:29 Coren: switching gridmaster activity to tools-grid-shadow
* 05:13 yuvipanda: increased security groups quota to 50 because why not
 
=== 2015-12-01 ===
* 21:07 yuvipanda: added bd808 as admin
* 21:01 andrewbogott: deleted tool/service group tools.test300
 
=== 2015-11-25 ===
* 15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002
 
=== 2015-11-20 ===
* 22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
* 21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
* 21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
* 21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
* 21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
* 20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
* 20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
* 20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
* 20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
* 19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
* 19:25 Coren: -lighttpd-1403 wants a restart.
* 19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
* 18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
* 18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
* 18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services
 
=== 2015-11-17 ===
* 19:39 YuviPanda: created tools-worker-03 to be k8s worker node
* 19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens
 
=== 2015-11-16 ===
* 20:44 PlasmaFury: switch over the proxy to tools-proxy-01
* 17:38 PlasmaFury: deleted tools-webgrid-lighttpd-1412 for https://phabricator.wikimedia.org/T118654
 
=== 2015-11-03 ===
* 03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).
 
=== 2015-11-02 ===
* 22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
* 22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
* 22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
* 19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
* 19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs
 
=== 2015-10-26 ===
* 20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts
 
=== 2015-10-11 ===
* 22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code
 
=== 2015-10-09 ===
* 22:47 yuvipanda: kill NFS on tools-puppetmaster-01 with https://wikitech.wikimedia.org/wiki/Hiera:Tools/host/tools-puppetmaster-01
* 14:37 Coren: Beginning rotation of execution nodes to apply fix for T106170
 
=== 2015-10-06 ===
* 04:35 yuvipanda: created tools-puppetmaster-02 as hot spare
 
=== 2015-10-02 ===
* 17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).
 
=== 2015-10-01 ===
* 23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
* 23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
* 23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
* 22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel
 
=== 2015-09-30 ===
* 07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
* 06:40 yuvipanda: migrated webproxy to tools-proxy-01
 
=== 2015-09-29 ===
* 12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).
 
=== 2015-09-28 ===
* 15:24 Coren: rebooting tools-shadow after mount option changes.
 
=== 2015-09-25 ===
* 16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).
 
=== 2015-09-24 ===
* 14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
* 13:56 scfc_de: tools-master: Restarted grid engine master for T109485.
 
=== 2015-09-23 ===
* 18:22 valhallasw`cloud: here = https://etherpad.wikimedia.org/p/74j8K2zIob
* 18:22 valhallasw`cloud: experimenting with https://github.com/jordansissel/fpm on tools-packages, and manually installing packages for that. Noting them here.
 
=== 2015-09-16 ===
* 17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
* 01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
* 01:17 YuviPanda: attempting to move to kubernetes
 
=== 2015-09-15 ===
* 01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.
 
=== 2015-09-14 ===
* 23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.
 
=== 2015-09-13 ===
* 20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
* 20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).
 
=== 2015-09-11 ===
* 14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).
 
=== 2015-09-08 ===
* 08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.<br>Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
* 08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
* 08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools
 
=== 2015-09-07 ===
* 18:49 valhallasw`cloud: ran sudo mount -o remount /data/project    on tools-static-01, which also solved the issue, so skipping the reboot
* 18:47 valhallasw`cloud: switched static webserver to tools-static-02
* 18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
* 17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master
 
=== 2015-09-03 ===
* 07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
* 07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
* 07:07 valhallasw`cloud: err, is empty.
* 07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!
 
=== 2015-09-02 ===
* 15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
* 13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
* 13:55 valhallasw`cloud: restarted gridengine_exec on  tools-exec-1403
* 13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678  = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
* 13:16 YuviPanda: deleted all jobs of ralgisbot
* 13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
* 12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles
 
=== 2015-09-01 ===
* 21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
* 16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
* 15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
* 06:23 valhallasw`cloud: seems to have worked. SGE :(
* 06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
* 06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
* 06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
* 06:06 valhallasw`cloud: investigating SGE issues reported on irc/email
 
=== 2015-08-31 ===
* 23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
* 21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
* 21:20 valhallasw`cloud: restarted webservicemonitor
* 21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
* 21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
* 21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
* 21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
* 20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
* 20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
* 20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
* 20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
* 19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
* 19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
* 19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
* 07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)
 
=== 2015-08-30 ===
* 13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
* 13:20 valhallasw`cloud: disabling 503 error page
 
=== 2015-08-29 ===
* 04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".
 
=== 2015-08-27 ===
* 15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again
 
=== 2015-08-26 ===
* 01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start.  If it goes berserk, please service bigbrothermonitor stop.
 
=== 2015-08-25 ===
* 20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
* 14:58 YuviPanda: pooled in two new instances for the precise exec pool
* 14:45 YuviPanda: reboot tools-exec-1221
* 14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
* 14:18 YuviPanda: pooled in tools-webgrid-generic-1405
* 10:16 YuviPanda: created tools-webgrid-generic-1405
* 10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
* 09:59 YuviPanda: created tools-exec-1220 and -1221
 
=== 2015-08-24 ===
* 16:37 valhallasw`cloud: more processes were started, so added a talk page message on [[User:Coet]] (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
* 16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
* 16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01
 
=== 2015-08-20 ===
* 18:44 valhallasw`cloud: both are now at 3dbbc87
* 18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
* 18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
* 18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
* 17:06 valhallasw`cloud: wait, what timezone is this?!
 
=== 2015-08-19 ===
* 10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406
 
=== 2015-08-18 ===
* 15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
* 14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
* 13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
* 13:55 valhallasw`cloud: no, wait, that's ''tools-webgrid-lighttpd-1411.eqiad.wmflabs'', not the actual host ''tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs''. We should fix that dns mess as well.
* 13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
* 13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
* 13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using <code>for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done</code>
* 08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
* 08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
* 08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
* 08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
* 08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
* 08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻
* 08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
* 08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
* 08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
* 08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
* 08:00 valhallasw`cloud: running puppet agent -tv again
* 07:55 valhallasw`cloud: argh. Disabling  toollabs::node::web::generic again and enabling  toollabs::node::web::lighttpd
* 07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
* 07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory  --- ran sudo touch /usr/lib/adminbot/README
* 07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
* 07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
* 07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...
 
=== 2015-08-17 ===
* 19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
* 16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
* 15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215  tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
* 14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215  tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
 
=== 2015-08-15 ===
* 05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
* 05:10 andrewbogott: suspending tools-exec-gift, just for a moment...
 
=== 2015-08-14 ===
* 17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215  tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
* 15:20 andrewbogott: Adding back to the grid engine queue:  tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
* 14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2015-08-13 ===
* 18:51 valhallasw`cloud: which was resolved by scfc earlier
* 18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid <br> Run of Puppet configuration client already in progress; skipping  (/var/lib/puppet/state/agent_catalog_run.lock exists))
* 18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
* 16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
* 14:48 andrewbogott: and tools-webgrid-lighttpd-1408
* 14:48 andrewbogott: rescheduling (and in some cases killing) jobs on  tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204  tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405
 
=== 2015-08-12 ===
* 16:05 andrewbogott: depooling  tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204  tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
* 15:20 valhallasw`cloud: re-enabling queues on restarted hosts
* 14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410
 
=== 2015-08-11 ===
* 18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow
 
=== 2015-08-04 ===
* 13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).
 
=== 2015-08-03 ===
* 19:13 andrewbogott: deleted tools-static-01
 
=== 2015-08-01 ===
* 18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
* 16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).
 
=== 2015-07-30 ===
* 15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
* 14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
* 02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
* 02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
* 02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
* 02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
* 01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
 
=== 2015-07-29 ===
* 23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
* 20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
* 19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}
 
=== 2015-07-28 ===
* 17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
* 17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
* 17:16 valhallasw`cloud: disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs"
* 02:07 YuviPanda: removed pacct files from tools-bastion-01
 
=== 2015-07-27 ===
* 21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of [[phab:T107052]]: <pre>accton off</pre>
 
=== 2015-07-19 ===
* 01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).
 
=== 2015-07-11 ===
* 00:01 mutante: fixing puppet runs on tools-webgrid-* via salt
 
=== 2015-07-10 ===
* 23:59 mutante: fixing puppet runs on tools-exec via salt
 
=== 2015-07-10 ===
* 20:09 valhallasw`cloud: it took three of us, but adminbot is updated!
 
=== July 6 ===
* 09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)
 
=== July 2 ===
* 17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
* 16:12 valhallasw`cloud: I mean tools-bastion-01
* 16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/
 
=== June 29 ===
* 17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02
 
=== June 21 ===
* 18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
* 16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
* 16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).
 
=== June 19 ===
* 15:07 YuviPanda: remounting /data/scratch
 
=== June 10 ===
* 11:52 YuviPanda: tools-trusty be gone
 
=== June 8 ===
* 16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access
 
=== June 7 ===
* 17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS
 
=== June 5 ===
* 17:44 YuviPanda: migrate tools-shadow to labvirt1002
 
=== June 2 ===
* 18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
* 16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
* 16:20 Coren: switching back to tools-master
* 16:10 YuviPanda: restart nscd on tools-submit
* 15:54 Coren: Switching names for tools-exec-1401
* 15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
* 14:34 YuviPanda: turned off dnsmasq for toollabs
* 13:54 Coren: adding new-style names for submit hosts
* 13:53 YuviPanda: moved tools-master / shadow to designate
* 13:52 Coren: new-style names for gridengin admin hosts added
* 13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
* 13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
* 13:17 Coren: killing the sge_qmaster to test failover
* 12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd
 
=== May 29 ===
* 13:39 YuviPanda: tools-redis-01 is redis master now
* 13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
* 13:01 YuviPanda: recreating tools-redis-01 and -02
* 12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
* 12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff
 
=== May 28 ===
* 12:22 wm-bot: petrb: inserted some local IP's to hosts file
* 12:15 wm-bot: petrb: shutting nscd off on tools-master
* 12:14 wm-bot: petrb: test
* 11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
* 11:25 petan: rebooted tools-master in order to try fix that network issues
 
=== May 27 ===
* 20:10 LostPanda: disabled puppet on tools-shadow too
* 19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster  haaail someone?
* 19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
* 18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS
 
=== May 23 ===
* 19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).
 
=== May 22 ===
* 20:37 yuvipanda: deleted and depooled tools-exec-07
 
=== May 20 ===
* 20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
* 20:01 yuvipanda: enabling puppet on all hosts
* 20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
* 19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
* 19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
* 19:54 yuvipanda: enabled puppet on tools-precise-dev
* 19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
* 06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state
 
=== May 19 ===
* 21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
* 20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
* 20:12 yuvipanda: force killed croptool webservice
 
=== May 18 ===
* 01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
* 01:32 yuvipanda: killed tools-checker-01 instance, recreating
 
=== May 15 ===
* 12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
* 12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
* 00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis
 
=== May 14 ===
* 21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
* 21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
* 03:29 yuvipanda: drained, depooled and deleted tools-exec-15
 
=== May 10 ===
* 22:08 yuvipanda: created tools-precise-dev instance
* 09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
* 05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.
 
=== May 5 ===
* 18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab
 
=== May 4 ===
* 21:24 yuvipanda: reboot tools-submit, was stuck
 
=== May 2 ===
* 10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
* 10:13 yuvipanda: cleaned out wegrid  jobs from tools-webgrid-03
* 10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
* 08:56 yuvipanda: drained and deleted tools-webgrid-01
* 07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
* 07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
* 06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
* 03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
* 02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
* 01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
* 01:58 yuvipanda: increased tools instance quota
 
=== May 1 ===
* 03:55 YuviKTM: depooled and deleted tools-exec-20
* 03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node
 
=== April 30 ===
* 19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
* 19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
* 06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
* 06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
* 05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
* 05:40 YuviKTM: pooled in tools-exec-121{1-9}
* 05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
* 05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
* 05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
* 05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
* 05:39 YuviKTM: delete tools-exec-10, was out of jobs
* 04:28 YuviKTM: deleted tools-exec-09
* 04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
* 04:23 YuviKTM: repooled tools-exec-1201 is all good now
* 04:19 YuviKTM: rejuggle jobs again in trustyland
* 04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
* 04:08 YuviKTM: depooled tools-exec-09, apt troubles
* 04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
* 04:00 YuviKTM: pooled tools-exec-1406 and 1407
* 03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
* 03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
* 03:53 YuviKTM: depooled tools-exec-03 / 04
* 03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
* 03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
* 03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
* 03:18 YuviKTM: pooled tools-exec-1403, 1404
* 03:13 YuviKTM: pooled tools-exec-1402
* 03:07 YuviKTM: pooled tools-exec-1405
* 03:04 YuviKTM: pooled tools-exec-1401
* 02:53 YuviKTM: created tools-exec-14{06-10}
* 02:14 YuviKTM: created toolx-exec-14{01-05}
* 01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod
 
=== April 29 ===
* 19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: [[Nova_Resource:I-00000bca.eqiad.wmflabs]]
* 19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
* 19:28 YuviPanda: recreated tools-static-02
* 19:11 YuviPanda: failed over tools-static to tools-static-01
* 14:47 andrewbogott: deleting tools-exec-04
* 14:44 Coren: -exec-04 drained; removed from queues.  Rest well, old friend.
* 14:41 Coren: disabled -exec-04 (going away)
* 02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
* 02:27 YuviPanda: created tools-exec-12{01-10}
 
=== April 28 ===
* 21:41 andrewbogott: shrinking tools-master
* 21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
* 21:32 andrewbogott: shrinking tools-redis
* 21:28 YuviPanda: attempting to failover gridengine to tools-shadow
* 21:27 andrewbogott: shrinking tools-submit            |
* 21:21 YuviPanda: backup crontabs onto NFS
* 21:18 andrewbogott: shrinking  tools-webproxy-02
* 21:14 andrewbogott: shrinking  tools-static-01
* 21:11 andrewbogott: shrinking tools-exec-gift
* 21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
* 21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
* 21:01 YuviPanda: failover tools-static to tools-static-02
* 20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
* 20:43 andrewbogott: stopping, shrinking, starting tools-static-02
* 20:39 valhallasw`cloud: created tools-mailrelay-01 [[Nova_Resource:I-00000bac.eqiad.wmflabs]]
* 20:26 YuviPanda: failed over tools-services to services-01
* 18:11 Coren: reenabled -webgrid-generic-02
* 18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
* 17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
* 14:04 Coren: reenable -exec-11 for jobs.
* 13:55 andrewbogott: stopping tools-exec-11 for a resize experiment
 
=== April 25 ===
* 01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
* 01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug
 
=== April 24 ===
* 16:29 Coren: repooled -exec-02, -08, -12
* 16:05 Coren: -exec-02, -08 and -12 draining
* 15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
* 15:41 Coren: -exec-03 goes away for good.
* 15:31 Coren: draining -exec-03 to ease migration
* 13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot
 
=== April 23 ===
* 22:41 YuviPanda: disabled *@tools-exec-09
* 22:40 YuviPanda: add tools-exec-09 back to @general
* 22:38 YuviPanda: take tools-exec-09 from @general group
* 20:53 YuviPanda: restart bigbrother
* 20:28 YuviPanda: restarted nscd on tools-login and tools-dev
* 20:22 valhallasw`cloud: removed <code>10.68.16.4 tools-webproxy tools.wmflabs.org</code> from /etc/hosts
* 13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
* 01:00 YuviPanda: good bye tools-login.eqiad.wmflabs
 
=== April 20 ===
* 13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.
 
=== April 18 ===
* 20:09 YuviPanda:  sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
* 19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting
 
=== April 17 ===
* 01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02
 
=== April 16 ===
* 20:57 Coren: -webgrid-08 drained, rebooting
* 20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
* 20:45 Coren: -webgrid-03 drained, rebooting
* 20:38 Coren: -webgrid-03 depooled
* 20:38 Coren: -webgrid-02 repooled
* 20:35 Coren: -webgrid-02 drained, rebooting
* 20:33 Coren: -webgrid-02 depooled
* 20:32 Coren: -webgrid-01 repooled
* 20:06 Coren: -webgrid-01 drained, rebooting.
* 19:56 Coren: depooling -webgrid-01 for reboot
* 14:37 Coren: rebooting -master
* 14:29 Coren: rebooting -mail
* 14:22 Coren: rebooting -shadow
* 14:22 Coren: -exec-15 repooled
* 14:19 Coren: -exec-15 drained, rebooting.
* 13:46 Coren: -exec-14 repooled.  That's it for general exec nodes.
* 13:44 Coren: -exec-14 drained, rebooting.
 
=== April 15 ===
* 21:06 Coren: -exec-10 repooled
* 20:55 Coren: -exec-10 drained, rebooting
* 20:49 Coren: -exec-07 repooled.
* 20:47 Coren: -exec-07 drained, rebooting
* 20:43 Coren: -exec-06 requeued
* 20:41 Coren: -exec-06 drained, rebooting
* 20:15 Coren: repool -exec-05
* 20:10 Coren: -exec-05 drained, rebooting.
* 19:56 Coren: -exec-04 repooled
* 19:52 Coren: -exec-04 drained, rebooting.
* 19:41 Coren: disabling new jobs on remaining (exec) precise instances
* 19:32 Coren: repool -exec-02
* 19:30 Coren: draining -exec-04
* 19:29 Coren: -exec-02 drained, rebooting
* 19:28 Coren: -exec-03 rebooted, requeing
* 19:26 Coren: -exec-03 drained, rebooting
* 18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
* 18:43 Coren: tools-exec-01 back sans idmap, returning to pool
* 18:40 Coren: tools-exec-01 drained of jobs; rebooting
* 18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
* 18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.
 
=== April 14 ===
* 13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
* 13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
 
=== April 13 ===
* 21:11 YuviPanda: restart portgranter on all webgrid nodes
 
=== April 12 ===
* 10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== April 11 ===
* 21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
* 02:15 YuviPanda: rebooted tools-submit, was not responding
 
=== April 10 ===
* 07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
* 05:20 YuviPanda: delete the tomcat node finally :D
 
=== April 9 ===
* 23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
* 23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
* 08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
 
=== April 8 ===
* 13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
* 12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
* 09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.
 
=== April 7 ===
* 07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
 
=== April 5 ===
* 10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== April 4 ===
* 22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
* 08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
* 08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
* 03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).
 
=== April 3 ===
* 22:55 scfc_de: Removed empty cgi-bin directories.
* 20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== April 2 ===
* 20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
* 20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
* 01:25 YuviPanda: created tools-bastion-02
 
=== April 1 ===
* 00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).
 
=== March 31 ===
* 14:02 Coren: rebooting tools-submit
* 07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
* 07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
* 00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources.  It can be restarted any time.
 
=== March 30 ===
* 22:53 Coren: resyncing project storage with rsync
* 22:40 Coren: reboot tools-login
* 22:30 Coren: also bastion2
* 22:28 Coren: reboot bastion1 so users can log in
* 21:49 Coren: rebooting dedicated exec nodes.
* 21:49 Coren: rebooting tools-submit
* 17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
 
=== March 29 ===
* 19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.
 
=== March 28 ===
* 19:42 YuviPanda: created tools-exec-20
 
=== March 26 ===
* 21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== March 25 ===
* 16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
 
=== March 24 ===
* 16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
* 15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== March 23 ===
* 21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
* 21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
* 20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
* 03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up
 
=== March 22 ===
* 23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
* 23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
* 23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01
 
=== March 21 ===
* 16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
 
=== March 15 ===
* 22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== March 13 ===
* 16:23 YuviPanda: cleaned out / on tools-trusty
 
=== March 11 ===
* 04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
* 04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
* 03:56 YuviPanda: restarted redis server, it had OOM-killed
 
=== March 9 ===
* 11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
* 10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
* 10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
* 08:27 scfc_de: qmod -cq webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs (OOM of two jobs in the past).
 
=== March 7 ===
* 12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.
 
=== March 6 ===
* 07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
* 07:43 scfc_de: Deployed jobutils/misctools 1.4.
 
=== March 2 ===
* 09:53 YuviPanda: added ananthrk to project
* 08:41 YuviPanda: delete tools-uwsgi-01
* 08:11 YuviPanda: delete tools-uwsgi-02 because https://phabricator.wikimedia.org/T91065
 
=== March 1 ===
* 15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure
 
=== February 28 ===
* 07:51 YuviPanda: create tools-webgrid-07
* 01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
* 01:00 Coren: Also That was -webgrid-05
* 00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.
 
=== February 27 ===
* 17:53 YuviPanda: increased quota to 512G RAM and 256 cores
* 15:33 Coren: Switched back to -master.  I'm making a note here: great success.
* 15:27 Coren: Gridengine master failover test part three; killing the master with -9
* 15:20 Coren: Gridengine master failover test part deux - now with verbose logs
* 15:10 YuviPanda: created tools-webgrid-generic-02
* 15:10 YuviPanda: increase instance quota to 64
* 15:10 Coren: Master restarted - test not sucessful.
* 14:50 Coren: testing gridengine master failover starting now
* 08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well
 
=== February 24 ===
* 18:33 Coren: tools-submit not recovering well from outage, kicking it.
* 17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs
 
=== February 16 ===
* 02:31 scfc_de: rm -f /var/log/exim4/paniclog.
 
=== February 13 ===
* 18:01 Coren: tools-redis is dead, long live tools-redis
* 17:48 Coren: rebuilding tools-redis with moar ramz
* 17:38 legoktm: redis on tools-redis is OOMing?
* 17:26 marktraceur: restarting grrrit-wm because it's not behaving
 
=== February 1 ===
* 10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
* 07:51 YuviPanda: cleared error state of stuck queues
* 06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
* 05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
* 05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
* 04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
* 04:10 YuviPanda: widar moved to trusty
* 03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.
 
=== January 29 ===
* 17:26 YuviPanda: reschedule all tomcat jobs
 
=== January 27 ===
* 23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo
 
=== January 19 ===
* 20:51 YuviPanda: because valhallasw is nice
* 10:34 YuviPanda: manually started tools-webgrid-generic-01
* 09:48 YuviPanda: restarted toold-webgrid-03
* 08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
* 08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.
 
=== January 16 ===
* 22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.
 
=== January 15 ===
* 22:10 YuviPanda: created instance tools-webgrid-generic-01
 
=== January 11 ===
* 06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.
 
=== January 8 ===
* 07:40 YuviPanda: increase memory limit for autolist from 4G to 7G
 
=== December 23 ===
* 06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000
 
=== December 22 ===
* 07:43 YuviPanda: increased RAM and Cores quota for tools
 
=== December 19 ===
* 16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
* 12:00 YuviPanda|brb: created tools-static, static http server
* 07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== December 17 ===
* 22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get
 
=== December 12 ===
* 14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.
 
=== December 11 ===
* 07:58 YuviPanda: rebooted tools-login, wasn’t responsive.
 
=== December 8 ===
* 00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy
 
=== December 7 ===
* 06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
* 06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
 
=== December 2 ===
* 21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
* 21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 26 ===
* 19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty
 
=== November 25 ===
* 06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 24 ===
* 14:02 YuviPanda: rebooting tools-login, OOM'd
* 02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 22 ===
* 19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 17 ===
* 20:40 YuviPanda: cleaned out /tmp on tools-login
 
=== November 16 ===
* 21:31 matanya: back to normal
* 21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"
 
=== November 15 ===
* 07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda
 
=== November 14 ===
* 20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
* 18:26 YuviPanda: cleaned out core dumps on tools-webgrid
* 16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).
 
=== November 13 ===
* 21:11 YuviPanda: disable puppet on tools-dev to check shinken
* 21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
* 20:38 YuviPanda: didn't actually stop puppet, need more patches
* 20:38 YuviPanda: stopping puppet on tools-dev to test shinken
* 15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
* 13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.
 
=== November 12 ===
* 22:07 StupidPanda: enabled puppet on tools-exec-07
* 21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
* 21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
* 20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken
 
=== November 7 ===
* 13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).
 
=== November 6 ===
* 13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).
 
=== November 5 ===
* 19:15 mutante: exec nodes have p7zip-full now
* 10:07 YuviPanda: cleaned out pacct and atop logs on tools-login
 
=== November 4 ===
* 19:50 mutante: - apt-get clean on tools-login, and gzipped some logs
 
=== November 1 ===
* 12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).
 
=== October 30 ===
* 14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
* 06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp
 
=== October 27 ===
* 16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
* 15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.
 
=== October 26 ===
* 12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
* 12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
* 12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.
 
=== October 24 ===
* 20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006
 
=== October 23 ===
* 22:55 Coren: reboot tools-shadow, upstart seems hosed
 
=== October 14 ===
* 23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07
 
=== October 11 ===
* 15:31 andrewbogott: rebooting tools-master, stab in the dark
* 06:01 YuviPanda: restarted gridengine-master on tools-master
 
=== October 4 ===
* 18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b
 
=== October 2 ===
* 17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
* 17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools
 
=== September 28 ===
* 14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3
 
=== September 25 ===
* 14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now
 
=== September 17 ===
* 21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap
 
=== September 15 ===
* 11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work
 
=== September 13 ===
* 20:52 yuvipanda: cleaned out rotated log files on tools-webproxy
 
=== September 12 ===
* 21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase
 
=== September 8 ===
* 16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)
 
=== September 5 ===
* 22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
* 22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
* 18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"
 
=== September 4 ===
* 19:47 lokal-profil: local-heritage Renamed two swedish tables
 
=== September 2 ===
* 04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076
 
=== August 23 ===
* 17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11  : before job")
 
=== August 21 ===
* 20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils
 
=== August 15 ===
* 16:45 legoktm: fixed grrrit-wm
* 16:36 legoktm: restarting grrrit-wm
 
=== August 14 ===
* 22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529
 
=== August 12 ===
* 03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again
 
=== August 2 ===
* 16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
* 16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs
 
=== August 1 ===
* 22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")
 
=== July 24 ===
* 20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
* 14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts
 
=== July 21 ===
* 18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
* 18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
* 18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again
 
=== July 18 ===
* 14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
* 13:24 scfc_de: Made tools-webgrid-04 a grid submit host
* 12:58 scfc_de: Made tools-webgrid-03 a grid submit host
 
=== July 16 ===
* 22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
* 15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
* 15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st
 
=== July 15 ===
* 11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04
 
=== July 14 ===
* 20:40 andrewbogott: on tools-login
* 20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update
 
=== July 13 ===
* 13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
* 13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore
 
=== July 12 ===
* 17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
* 17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
* 13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
* 12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
* 11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later
 
=== July 11 ===
* 11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts
 
=== July 10 ===
* 20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
* 16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
* 01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup
 
=== July 9 ===
* 23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
* 23:09 YuviPanda: created tools-exec-13 with precise
* 23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
* 23:07 YuviPanda: created tools-exec-12
* 23:06 YuviPanda: created tools-exec-11
* 19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
* 14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
* 13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
* 12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
* 12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
* 12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
* 12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
* 12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
* 12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08
 
=== July 8 ===
* 20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
* 19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
* 18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
* 17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
* 17:26 scfc_de: tools-tcl-test: Rebooted because system said so
* 17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
* 14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
* 13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")
 
=== July 6 ===
* 12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged
 
=== July 5 ===
* 22:36 YuviPanda: cleared diamond archive logs on a bunch of machines, submitted patch to get rid of archive logs
* 22:17 YuviPanda: changed grid scheduling config, set weight_priority to 0.1 from 0.0 for https://bugzilla.wikimedia.org/show_bug.cgi?id=67555
 
=== July 4 ===
* 08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
* 00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond
 
=== July 3 ===
* 16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
* 16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
* 14:37 Betacommand: replication for enwiki is halted current lag is at 9876
 
=== July 2 ===
* 00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
* 00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats
 
=== July 1 ===
* 23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
* 21:08 scfc_de: Reset queues in error state again
* 17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
* 16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
* 15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
* 15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
* 15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
* 01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
* 01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
* 01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
* 01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*
 
=== June 30 ===
* 22:10 YuviPanda: ran webservice start for enwp10
* 22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
* 22:01 YuviPanda: removed stale lockfile for puppet, forcing run
* 19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
* 17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
* 17:27 YuviPanda: created tools-webgrid-03 and added it to the queue
 
=== June 29 ===
* 19:45 scfc_de: magnustools: "webservice start"
* 18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead
 
=== June 28 ===
* 21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy
 
=== June 21 ===
* 20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki
 
=== June 20 ===
* 21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
* 14:47 scfc_de: Restarted webservice for mono; cf. bug #64219
 
=== June 16 ===
* 23:50 scfc_de: Shut down diamond services and removed log files on all hosts
 
=== June 15 ===
* 17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
* 16:50 scfc_de: qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs
* 16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
* 16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)
 
=== June 13 ===
* 22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange
 
=== June 10 ===
* 21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
* 21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var
 
=== June 3 ===
* 17:50 Betacommand: Brief network outage. source:  It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.
 
=== June 2 ===
* 20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
* 19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though.  iipsrv.fcgi however has TMPDIR set as planned.
 
=== May 27 ===
* 18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
* 10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
* 10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
* 07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log
 
=== May 25 ===
* 14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors
 
=== May 23 ===
* 14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
* 14:10 andrewbogott: applying  role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors
 
=== May 22 ===
* 02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
* 02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
* 01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
* 01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
* 01:12 scfc_de: tools-mail: /var is full
 
=== May 20 ===
* 18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues
 
=== May 16 ===
* 17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, {{Gerrit|3861106688}})
 
=== May 14 ===
* 16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source {{Gerrit|127}}/8 --dport {{Gerrit|6379}} -j REJECT" to block connections from other Tools instances to Redis
* 00:23 Betacommand: {{Gerrit|503}}'s related to bug {{Gerrit|65179}}
 
=== May 13 ===
* 20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
* 20:36 valhallasw: redis failed, causing  tools-webproxy to thow {{Gerrit|503}}'s
* 19:09 marktraceur: Restarted grrrit because it had a stupid nick
 
=== May 10 ===
* 14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1
 
=== May 9 ===
* 13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes
 
=== May 6 ===
* 19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
* 17:53 scfc_de: Don't think replagstats is really working ...
* 16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #{{Gerrit|48694}}).
 
=== April 28 ===
* 11:51 YuviPanda: pywikibugs Deployed {{Gerrit|bf1be7b55a19457469f311ae54e1cf6409eb4a0b}}
 
=== April 27 ===
* 13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1
 
=== April 24 ===
* 23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
* 23:38 YuviPanda: restarted greg-g after cherry-picking {{Gerrit|aec09a6f669bc1806557576212aa218bfa520c35}} for auth of IRC bot
* 23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/{{Gerrit|129610}}
* 13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)
 
=== April 20 ===
* 14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
* 14:08 scfc_de: tools-redis: /var is full
* 08:59 legoktm: grrrit-wm: {{Gerrit|2014}}-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:{{Gerrit|6379}} failed - connect ECONNREFUSED
* 08:48 legoktm: Your job {{Gerrit|438884}} ("lolrrit-wm") has been submitted
* 08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)
 
=== April 13 ===
* 14:20 scfc_de: Restarted webservice for wikihistory to see if the change to PHP_FCGI_MAX_REQUESTS increases reliability
* 14:17 scfc_de: tools-webgrid-01, tools-webgrid-02: Set PHP_FCGI_MAX_REQUESTS to {{Gerrit|500}} in /usr/local/bin/lighttpd-starter per http://redmine.lighttpd.net/projects/1/wiki/docs_performancefastcgi#Why-is-my-PHP-application-returning-an-error-{{Gerrit|500}}-from-time-to-time
 
=== April 12 ===
* 23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")
 
=== April 11 ===
* 16:21 scfc_de: tools-login: Killed -HUP process consuming 2.6 GByte; cf. [[wikitech:User talk:Ralgis#Welcome to Tool Labs]]
 
=== April 10 ===
* 18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes
 
=== April 8 ===
* 05:06 Ryan_Lane: restart nginx on tools-proxy-test
* 05:03 Ryan_Lane: upgraded libssl on all nodes
 
=== April 4 ===
* 15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
* 11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request
 
=== March 30 ===
* 18:16 scfc_de: Removed empty directories /data/project/{{{Gerrit|d930913}},sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
* 18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier
 
=== March 29 ===
* 10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server
 
=== March 28 ===
* 11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
* 11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
* 11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login
 
=== March 21 ===
* 16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
* 16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis {{Gerrit|50383}} {{Gerrit|4096}} Mär 17 16:42", perhaps artifact of mass migration?)
* 15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:{{Gerrit|62132}}]]
* 15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
* 15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8
 
=== March 20 ===
* 18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:{{Gerrit|62883}}]]
 
=== March 5 ===
* 13:57 wm-bot: petrb: test
 
=== March 4 ===
* 22:35 wm-bot: petrb: uninstalling it from -login too
* 22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there
 
=== March 3 ===
* 19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
* 19:17 wm-bot: petrb: upgrading all packages on webserver-02
* 19:15 petan: rebooting webserver-01 which is totally dead
* 19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
* 19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
* 16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc.  "webservice start" for that, and pkill -HUP -u local-guc.
* 12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
* 12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
* 12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.
 
=== March 1 ===
* 03:42 Coren: disabled puppet in pmtpa tool labs\
 
=== February 28 ===
* 14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
* 00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "{{Gerrit|500}} Internal Server Error"
 
=== February 27 ===
* 15:28 scfc_de: chmod g-w ~fsainsbu/.forward
 
=== February 25 ===
* 22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.
 
=== February 23 ===
* 20:46 scfc_de: morebots: labs HUPped to reconnect to IRC
 
=== February 21 ===
* 17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
* 15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either
 
=== February 20 ===
* 21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at [[User talk:Reza#Running bots on tools-login, etc.]] ([[:fa:بحث_کاربر:Reza1615]] is write-protected)
* 20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at [[:ko:사용자토론:ChongDae#Running bots on tools-login, etc.]]
* 10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:{{Gerrit|61583}}]])
* 10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
* 10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch
 
=== February 19 ===
* 20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
* 19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127
 
=== February 18 ===
* 11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
* 11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
* 10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
* 10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
 
=== February 14 ===
* 23:54 legoktm: restarting grrrit-wm since it disappeared
* 08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
 
=== February 13 ===
* 13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
* 13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
* 10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help
 
=== February 12 ===
* 07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc
 
=== February 11 ===
* 15:47 scfc_de: Restarted webservice for geohack
* 13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
* 13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05
 
=== February 10 ===
* 23:16 Coren: rebooting webproxy (braindead autofs)
 
=== February 9 ===
* 18:14 legoktm: restarting grrrit-wm, it keeps joining and quitting
* 04:27 legoktm: rebooting grrrit-wm - https://gerrit.wikimedia.org/r/#/c/112308
 
=== February 6 ===
* 22:50 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/111889
 
=== February 4 ===
* 20:38 legoktm: restarting grrrit-wm: 'Send mediawiki/extension/Thanks to -corefeatures' https://gerrit.wikimedia.org/r/111257
 
=== January 31 ===
* 03:43 scfc_de: Cleaned up all exim queues
* 01:26 scfc_de: chmod g-w ~{bgwhite,daniel,euku,fale,henna,hydriz,lfaraone}/.forward (test: sudo find /home -mindepth 2 -maxdepth 2 -type f -name .forward -perm /g=w -ls)
 
=== January 30 ===
* 21:48 scfc_de: chmod g-w ~fluff/.forward
* 21:40 scfc_de: local-betabot: Added "-M" option to crontab's qsub call and rerouted queued mail (freeze, exim -Mar, exim -Mmd, thaw)
* 18:33 scfc_de: tools-exec-04: puppetd --enable (apparently disabled sometime around 2014-01-16?!)
* 17:25 scfc_de: tools-exec-06: mv -f /etc/init.d/nagios-nrpe-server{.dpkg-dist,} (nagios-nrpe-server didn't start because start-up script tried to "chown icinga" instead of "chown nagios")
 
=== January 28 ===
* 04:27 scfc_de: tools-webproxy: Blocked Phonifier
 
=== January 25 ===
* 05:37 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (OOM)
 
=== January 24 ===
* 01:07 scfc_de: tools-db: Removed /var/lib/mysql2, set expire_logs_days to 1 day
* 00:11 scfc_de: tools-db: and restarted mysqld
* 00:11 scfc_de: tools-db: Moved 4.2 GBytes of the oldest binlogs to /var/lib/mysql2/
 
=== January 23 ===
* 19:24 legoktm: restarting grrrit-wm now https://gerrit.wikimedia.org/r/#/c/109116/
* 19:23 legoktm: ^ was for grrrit-wm
* 19:23 legoktm: re-committed password to local repo, not sure why that wasn't committed already
 
=== January 21 ===
* 17:41 scfc_de: tools-exec-09: iptables-restore /data/project/.system/iptables.conf
 
=== January 20 ===
* 07:02 andrewbogott: merged a lint patch to the gridengine module.  Should be a noop
 
=== January 16 ===
* 17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot
 
=== January 15 ===
* 13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19  : before writing exit_status")
* 13:27 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
* 08:54 andrewbogott: rebooted tools-exec-09
* 08:32 andrewbogott: rebooted tools-db
 
=== January 14 ===
* 15:10 scfc_de: tools-login: pkill -u local-mlwikisource: Freed 1 GByte of memory
* 14:58 scfc_de: tools-login: Disabled local-mlwikisource's crontab with explanation
* 13:57 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (out of memory errors on 2014-01-10)
 
=== January 10 ===
* 10:41 legoktm: grrrit-wm: restarting https://gerrit.wikimedia.org/r/106670
* 09:00 legoktm: grrrit-wm: setting up #mediawiki-feed, https://gerrit.wikimedia.org/r/106555
 
=== January 9 ===
* 18:26 legoktm: rebased grrrit-wm on origin/master since fetching gerrit was failing
* 18:21 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/#/c/106501/
 
=== January 8 ===
* 13:44 scfc_de: Cleared error states of continuous@tools-exec-05, task@tools-exec-05, task@tools-exec-09
 
=== January 7 ===
* 18:59 scfc_de: tools-login, tools-mail: rm -f /var/log/exim4/paniclog (apparently some artifacts of the LDAP failure)
 
=== January 6 ===
* 14:06 YuviPanda: deleted instance tools-mc, didn't know it had come back from the dead
 
=== January 1 ===
* 13:24 scfc_de: tools-exec-02, tools-master, tools-shadow, tools-webserver-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
* 11:27 scfc_de: tools-webserver-01, tools-webserver-01: rm -f /var/log/exim4/paniclog; out of memory errors
* 11:18 scfc_de: Emptied /{data/project,home}/.snaplist as the snapshots themselves are not available
 
=== December 27 ===
* 07:39 legoktm: grrrit-wm restart didn't really work.
* 07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak
 
=== December 23 ===
* 18:30 marktraceur: restart grrrit-wm for subbu
 
=== December 21 ===
* 06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
 
=== December 19 ===
* 17:22 marktraceur: deploying grrrit config change
 
=== December 17 ===
* 23:19 legoktm: rebooted grrrit-wm with new config stuffs
 
=== December 14 ===
* 18:13 marktraceur: restarting grrrit-wm to fix its nickname
* 13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
* 13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)
 
=== December 4 ===
* 22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
* 16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS
 
=== December 1 ===
* 14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.
 
=== November 25 ===
* 21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
* 12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)
 
=== November 24 ===
* 07:19 paravoid: disabled crontab for user avocato on tools-login, see above
* 07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion
 
=== November 14 ===
* 09:12 ori-l: Added aude to lolrrit-wm maintainers group
 
=== November 13 ===
* 22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year.  Which, a year ago is before that instance even existed so what the heck?
 
=== November 3 ===
* 16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.
 
=== November 1 ===
* 16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco
 
=== October 23 ===
* 15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)
 
=== October 20 ===
* 18:52 wm-bot: petrb: everything looks better
* 18:51 wm-bot: petrb: restarting apache server on tools-webproxy
* 18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it
 
=== October 15 ===
* 21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.
 
=== October 10 ===
* 09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again
 
=== September 23 ===
* 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: {{Gerrit|54444}})
* 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: {{Gerrit|54444}})
* 05:15 legoktm: logging a log to test the log logging
* 05:13 legoktm: logging a log to test the log logging
 
=== September 11 ===
* 09:39 wm-bot: petrb: started toolwatcher
 
=== August 24 ===
* 18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
* 17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot
 
=== August 23 ===
* 12:17 wm-bot: petrb: test
* 12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
* 11:49 wm-bot: petrb: syncing packages of -login with exec nodes
* 11:48 petan: someone installed firefox on exec nodes, should investigate / remove
 
=== August 22 ===
* 01:24 scfc_de: tools-webserver-03: Installed python-oursql
 
=== August 20 ===
* 23:00 scfc_de: Opened port 3000 for intra-Labs traffic in execnode security group for YuviPanda's proxy experiments
 
=== August 19 ===
* 09:52 wm-bot: petrb: deleting fatestwiki tool, requested by creator
 
=== August 16 ===
* 00:16 scfc_de: tools-exec-01 doesn't come up again even after repeat reboots
 
=== August 15 ===
* 15:14 scfc_de: tools-webserver-01: Simplified /usr/local/bin/php-wrapper
* 14:31 scfc_de: tools-webserver-01: "dpkg --configure -a" on apt-get's advice
* 14:24 scfc_de: chmod 644 ~magnus/.forward
* 03:07 scfc_de: tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt
* 02:02 scfc_de: robots.txt: "Disallow: /"
 
=== August 11 ===
* 03:14 scfc_de: tools-mc: Purged memcached
 
=== August 10 ===
* 02:36 scfc_de: Disabled terminatord on tools-login and tools-dev
* 02:24 scfc_de: chmod g-w ~whym/.forward
 
=== August 6 ===
* 19:26 scfc_de: Set up basic robots.txt to exclude Geohack to see how that affects traffic
* 02:09 scfc_de: tools-mail: Enabled rudimentary Ganglia monitoring in root's crontab
 
=== August 5 ===
* 20:32 scfc_de: chmod g-w ~ladsgroup/.forward
 
=== August 2 ===
* 23:45 scfc_de: tools-dev: Installed dialog for testing
 
=== August 1 ===
* 19:57 scfc_de: Created new instance tools-redis with redis_maxmemory = "7GB"
* 19:56 scfc_de: Added redis_maxmemory to wikitech Puppet variables
 
=== July 31 ===
* 10:50 HenriqueCrang: ptwikis added graph with mobile edits
 
=== July 30 ===
* 19:08 scfc_de: tools-webproxy: Purged popularity-contest and ubuntu-standard
* 07:32 wm-bot: petrb: deleted local-addbot jobs
* 02:01 scfc_de: tools-webserver-01: Symlinked /usr/local/bin/{job,jstart,jstop,jsub} to /usr/bin; were obsolete versions.
 
=== July 29 ===
* 15:15 scfc_de: tools-webserver-01: rm /var/log/exim4/paniclog
* 15:10 scfc_de: Purged popularity-contest from tools-webserver-01.
* 02:40 scfc_de: Restarted toolwatcher on tools-login.
* 02:11 scfc_de: Reboot tools-login, was not responsive
 
=== July 25 ===
* 23:37 Ryan_Lane: added myself to lolrrit-wm tool
* 12:06 wm-bot: petrb: test
* 07:11 wm-bot: petrb: created /var/log/glusterfs/bricks/ to stop rotatelogs from complaining about it being missing
 
=== July 20 ===
* 15:19 petan: rebooting tools-redis
 
=== July 19 ===
* 07:06 petan: instances were rebooted for unknown reasons
* 00:42 helderwiki: it works! :-)
* 00:41 legoktm: test
 
=== July 10 ===
* 18:04 wm-bot: petrb: installing mysqltcl on grid
* 18:01 wm-bot: petrb: installing tclodbc on grid
 
=== July 5 ===
* 19:38 AzaToth: test
* 19:36 AzaToth: test for example
* 18:23 Coren: brief outage of webproxy complete (back to business!)
* 18:13 Coren: brief outage of webproxy (rollback 2.4 upgrade)
 
=== July 3 ===
* 13:44 scfc_de: Set "HostbasedAuthentication yes" and "EnableSSHKeysign yes" in tools-dev's /etc/ssh/ssh_config
* 12:58 petan: rebooting -mc it's aparently OOM dying
 
=== July 2 ===
* 16:24 wm-bot: petrb: installed maria to all nodes so we can connect to db even from sge
* 12:19 wm-bot: petrb: installing packages -- libmediawiki-api-perl libdatetime-format-strptime-perl libbot-basicbot-perl libdatetime-format-duration-perl
 
=== July 1 ===
* 18:39 wm-bot: petrb: started toolwatcher on - login
* 14:22 wm-bot: petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl
* 12:05 wm-bot: petrb: starting toolwatcher
* 11:40 wm-bot: petrb: tools is back o/
* 09:42 wm-bot: petrb: installing python -zmg -matplotlib @ dev
* 03:33 scfc_de: Rebooted tools-login apparently out of memory and not responding to ssh
 
=== June 30 ===
* 17:58 scfc_de: Set ssh_hba to yes on tools-exec-06
* 17:13 scfc_de: Installed python-matplotlib and python-zmq on tools-login for YuviPanda
 
=== June 26 ===
* 21:16 Coren: +Tim Landscheidt to project admins, local-admin
* 14:23 wm-bot: petrb: updating several packages on -login
* 13:43 wm-bot: petrb: killing old instance of redis: Jun15 ?        00:06:49 /usr/bin/redis-server /etc/redis/redis.conf
* 13:42 wm-bot: petrb: restarting redis
* 13:28 wm-bot: petrb: running puppet on -mc
* 13:27 wm-bot: petrb: adding ::redis role to tools-mc - if anything will break, YuviPanda did it :P
* 09:35 wm-bot: petrb: updated status.php to version which display free vmem as well
 
=== June 25 ===
* 12:34 wm-bot: petrb: installing php5-mcrypt on exec and web
 
=== June 24 ===
* 15:45 wm-bot: petrb: changed colors of root prompt productions vs testing
* 07:57 wm-bot: petrb: 50527    4186 22830  1 Jun23 pts/41  00:08:54 python fill2.py eats 48% of ram on -login
 
=== June 19 ===
* 12:17 wm-bot: petrb: increasing limit on mysql connections
 
=== June 17 ===
* 17:34 wm-bot: petrb: /var/spool/cron/crontabs/ has -rw------- 1                      8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing
 
=== June 16 ===
* 21:23 Coren: 1.0.3 deployed (jobutils, misctools)
 
=== June 15 ===
* 21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
* 21:39 wm-bot: petrb: db has 5% free RAM eeeek
* 18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
* 18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
* 18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
* 12:33 wm-bot: petrb: installing redis on tools-mc
 
=== June 14 ===
* 12:35 wm-bot: petrb: updating logsplitter to new version
 
=== June 13 ===
* 21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
* 12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
* 12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
* 12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
* 12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32  1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources
 
=== June 11 ===
* 07:36 wm-bot: petrb: installed libdigest-crc-perl
 
=== June 10 ===
* 13:05 wm-bot: petrb: installing libcrypt-gcrypt-perl
* 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
* 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
* 08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
* 08:25 wm-bot: petrb: fixing missing packages on exec nodes
 
=== June 9 ===
* 20:44 wm-bot: petrb: moved logs on -login to separate storage
 
=== June 8 ===
* 21:24 wm-bot: petrb: installing python-imaging-tk on grid
* 21:20 wm-bot: petrb: installing python-tk
* 21:16 wm-bot: petrb: installing python-flickrapi on grid
* 21:16 wm-bot: petrb: installing
* 16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
* 15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
* 09:55 wm-bot: petrb: backporting tcl 8.6 from debian
* 09:38 wm-bot: petrb: update python requests to version 1.2.3.1
 
=== June 7 ===
* 15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)
 
=== June 5 ===
* 09:52 wm-bot: petrb: on -dev
* 09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
* 09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
* 09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
* 09:28 wm-bot: petrb: installed openjdk7 on -dev
* 09:00 wm-bot: petrb: removing wd-terminator service
* 08:39 wm-bot: petrb: started toolwatcher
* 07:04 wm-bot: petrb: installing maven on -dev
 
=== June 4 ===
* 14:49 wm-bot: petrb: installing sbt in order to fix b48859
* 13:28 wm-bot: petrb: installing csh on cluster
* 08:37 wm-bot: petrb: installing python-memcache on exec nodes
 
=== June 3 ===
* 21:40 Coren: Rebooting -login; it's trashing.  Will keep an eye on it.
* 14:15 wm-bot: petrb: removing popularity contest
* 14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
* 09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc
 
=== June 2 ===
* 08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc
 
=== June 1 ===
* 20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
* 20:42 wm-bot: petrb: installing wikitools cluster wide
* 20:40 wm-bot: petrb: installing oursql cluster wide
* 10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc
 
=== May 31 ===
* 19:17 petan: deleting xtools project (requested by Cyberpower678)
* 17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
* 17:17 wm-bot: petrb: installed lsof to -dev
* 15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
* 15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
* 15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
* 15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
* 15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
* 13:04 wm-bot: petrb: installing bashdb on tools and -dev
* 12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
* 10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060  1964 88 May30 ?  21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached
 
=== May 30 ===
* 12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
* 08:26 wm-bot: petrb: updated sql command
 
=== May 29 ===
* 21:05 wm-bot: petrb: running sudo apt-get install php5-gd
 
=== May 28 ===
* 20:00 wm-bot: petrb: installing p7zip-full to -dev and -login
 
=== May 27 ===
* 08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted
 
=== May 24 ===
* 08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
* 08:28 petan: created 2 more exec nodes, setting up now...
 
=== May 23 ===
* 09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20
 
=== May 22 ===
* 20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
* 14:28 labs-logs-bottie: petrb: installed netcat as well
* 14:28 labs-logs-bottie: petrb: installed telnet to -dev
* 14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there
 
=== May 21 ===
* 20:27 labs-logs-bottie: petrb: uploaded hosts to -dev
 
=== May 19 ===
* 13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
* 12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
* 12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
* 12:50 petan: nvm previous line
* 12:50 labs-logs-bottie: petrb: vul alias viewuserlang
 
=== May 14 ===
* 21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
* 13:16 Coren: reboot -dev, need to test kernel upgrade
 
=== May 10 ===
* 15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation
 
=== May 9 ===
* 04:12 Coren: added -exec-03 and -exec-04.  Moar power!!1!
 
=== May 6 ===
* 19:59 Coren: made tools-dev.wmflabs.org public
* 08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
* 08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed
 
=== May 4 ===
* 17:50 labs-logs-bottie: petrb: foobar as well
* 17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
* 15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
* 12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar
 
=== May 2 ===
* 20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to
 
=== May 1 ===
* 16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home
 
=== April 27 ===
* 18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)
 
=== April 26 ===
* 23:55 Coren: reboot to finish security updates
* 08:00 labs-logs-bottie: petrb: patching qtop
* 07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
* 07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there
 
=== April 25 ===
* 19:00 Coren: Maintenance over; systems restarted and should be working.
* 18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
* 18:01 Coren: Begin maintenance (login disabled)
* 13:21 petan: removing local-wikidatastats from ldap
 
=== April 24 ===
* 13:17 labs-logs-bottie: petrb: sudo chown local-peachy PeachyFrameworkLogo.png
* 11:37 labs-logs-bottie: petrb: created new project stats and cloned acl from wikidatastats, which is supposed to be deleted
* 11:32 legoktm: wikidatastats attempting to install limn
* 11:15 labs-logs-bottie: petrb: installing npm to -login instance
* 07:34 petan: creating project wikidatastats for legoktm addshore and yuvipandianablah :P
 
=== April 23 ===
* 13:32 labs-logs-bottie: petrb: changing permissions of cyberbot and peachy to 775 so that it is easier to use them
* 12:14 labs-logs-bottie: petrb: qtop on -dev
* 12:12 labs-logs-bottie: petrb: removed part of motd from login server that got there in a mysterious way
 
=== April 19 ===
* 22:38 Coren: reboot -login, all done with the NFS config.  yeay.
* 17:13 Coren: (final?) reboot of -login with the new autofs configuration
* 16:24 Coren: (rebooted -login)
* 16:24 Coren: autofs + gluster = fail
* 14:45 Coren: reboot -login (NFS mount woes)
 
=== April 15 ===
* 22:29 Coren: also a test; note how said bot knows its place.  :-)
* 22:14 andrewbogott: this is a test of labs-morebots.
* 21:49 andrewbogott: this is a test
* 15:41 labs-logs-bottie: petrb: installing p7zip everywhere
* 08:00 labs-logs-bottie: petrb: installing dev packages needed for YuviPanda on login box
 
=== April 11 ===
* 22:39 Coren: rebooted tools-puppet-test (no end-user impact): hung filesystem prevents login
* 07:42 labs-logs-bottie: petrb: removed reboot information from motd
<noinclude>
=== April 10 ===
* 21:42 labs-logs-bottie: petrb: reverting the change
* 21:35 labs-logs-bottie: petrb: inserting /lib to /etc/ld.so.conf in order to fix the bug with gcc / ubuntu see irc logs (22:30 GMT)
* 21:22 labs-logs-bottie: petrb: installing jobutils.deb on login
* 20:30 labs-logs-bottie: petrb: installing some dev tools to -dev
* 20:23 petan: created -dev instance for various purposes
 
=== April 8 ===
* 14:07 labs-logs-bottie: petrb: ongrid apt-get install mono-complete
* 13:50 labs-logs-bottie: local-afcbot: unable to run mono applications: The assembly mscorlib.dll was not found or could not be loaded.
 
=== April 4 ===
* 14:40 labs-logs-bottie: petrb: trying to convert afcbot to new service group local-afcbot
 
=== April 2 ===
* 16:04 labs-logs-bottie: petrb: installed log to /home/petrb/bin/ and testing it
* 15:55 petan: patched /usr/local/bin/qdisplay so that it can display jobs per node properly
* 15:54 petan: giving sudo to Petrb in order to update qdisplay
 
=== March 28 ===
* 15:44 Coren: reboot (still unactivated) tools-shadow
 
=== March 26 ===
* 18:17 Coren: Doubled the size of the compute grid!  (added tools-exec-02 to the grid)
 
=== March 21 ===
* 23:30 Coren: turned on interpretation of .py as CGI by default on tools-webserver-* to parallel .php
* 16:15 Coren: Added tools-login.wmflabs.org public IP for the tools-login instance and allowed incoming ssh to it.
 
=== March 19 ===
* 14:21 Coren: reboot cycle (all instances) to apply security updates
 
=== March 13 ===
* 14:04 Coren: restarted webserver: relax AllowOverride options
 
=== March 11 ===
* 15:47 Coren: enabled X forwarding for qmon.  Also, installed qmon.
* 13:17 Coren: added python-requests (1.0, from pip)
 
=== March 7 ===
* 20:41 Coren: tools' php errors now sent to ~/php_errors.log
* 19:31 Coren: access.log now split by tools (in tool homedir)
* 16:15 Coren: can haz database (support for user/tool databases in place)
 
=== March 6 ===
* 20:25 Coren: tools-db installed mariadb-server from official repo
* 19:50 Coren: created tools-db instance for a (temporary) mysql install
 
=== March 5 ===
* 21:45 Coren: rejiggered the webproxy config to be smarter about paths not leading to specific tools
 
=== February 26 ===
* 23:49 Coren: Original note structure: created tools-{master,exec-01,webserver-01,webproxy} instances
* 18:39 Coren: Created tools-puppet-test for dev and testing of tools' puppet classes.
* 01:52 Coren: created instance tools-login (primary login/dev instance)
* 01:52 Coren: created sudo policies and security groups (skeletal)
* 01:08 Coren: Creation of the new project for preproduction deployment of the current (preleminary) plan [[mw:Wikimedia Labs/Tool Labs/Design]]
</noinclude>
</noinclude>
{{SAL|Project Name=tools}}
{{SAL|Project Name=tools}}
<noinclude>[[Category:SAL]]</noinclude>
<noinclude>[[Category:SAL]]</noinclude>

Revision as of 21:23, 28 September 2022

2022-09-28

  • 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858
  • 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, T318858
  • 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)

2022-09-22

  • 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group T317438
  • 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy T317438

2022-09-10

  • 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko

2022-09-07

  • 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks (T316854)

2022-09-06

  • 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder (T316854)

2022-08-25

  • 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner T293552

2022-08-24

2022-08-20

  • 07:44 dcaro_away: all k8s nodes ready now \o/ (T315718)
  • 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up (T315718)
  • 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking (T315718)

2022-08-18

  • 14:45 andrewbogott: adding lucaswerkmeister as projectadmin (T314527)
  • 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair

2022-08-17

  • 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # T315459
  • 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected

2022-08-16

  • 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05

2022-08-11

  • 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
  • 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues

2022-08-05

  • 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-08-03

2022-07-20

  • 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
  • 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-07-19

  • 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest

2022-07-17

  • 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-07-14

  • 13:48 taavi: rebooting tools-sgeexec-10-2

2022-07-13

  • 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-07-11

  • 16:06 wm-bot2: Increased quotas by {self.increases} (T312692) - cookbook ran by nskaggs@x1carbon

2022-07-07

  • 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus

2022-06-28

  • 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master (T311538) - cookbook ran by dcaro@vulcanus
  • 15:51 taavi: add 4096G cinder quota T311509

2022-06-27

  • 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
  • 18:02 taavi: switchover active cron server to tools-sgecron-2 T284767
  • 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:15 taavi: T311412 updating ca used by k8s-apiserver->etcd communication, breakage may happen
  • 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 T311412
  • 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it T311412

2022-06-23

  • 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 13:59 taavi: removing remaining continuous jobs from the stretch grid T277653

2022-06-22

  • 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-21

  • 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
  • 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko

2022-06-03

  • 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor T309821
  • 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online T309821
  • 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
  • 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor T309821
  • 15:50 balloons: temp add 1.0G swap to sgeweblight hosts T309821
  • 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
  • 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
  • 13:25 bd808: Upgrading fleet to tools-webservice 0.86 (T309821)
  • 13:20 bd808: publish tools-webservice 0.86 (T309821)
  • 12:46 taavi: start webservicemonitor on tools-sgecron-01 T309821
  • 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
  • 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid T309821
  • 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
  • 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package (T309821)
  • 03:10 bd808: publish tools-webservice 0.85 with hack for T309821

2022-06-02

  • 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
  • 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
  • 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
  • 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
  • 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
  • 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 12:03 dcaro: refresh prometheus certs (T308402)
  • 11:47 dcaro: refresh registry-admission-controller certs (T308402)
  • 11:42 dcaro: refresh ingress-admission-controller certs (T308402)
  • 11:36 dcaro: refresh volume-admission-controller certs (T308402)
  • 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
  • 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster T277653
  • 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster T277653
  • 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko

2022-06-01

  • 11:18 taavi: depool and remove tools-sgeexec-09[07-14]

2022-05-31

  • 16:51 taavi: delete tools-sgeexec-0904 for T309525 experimentation

2022-05-30

  • 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) T277653

2022-05-26

2022-05-22

  • 17:04 taavi: failover tools-redis to the updated cluster T278541
  • 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud (T308982) - cookbook ran by taavi@runko

2022-05-16

2022-05-14

  • 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940

2022-05-12

2022-05-10

  • 15:18 taavi: depool tools-k8s-worker-42 for experiments
  • 13:54 taavi: enable distro-wikimedia unattended upgrades T290494

2022-05-06

  • 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl (T307812)

2022-05-05

  • 17:28 taavi: deploy tools-webservice 0.83 T307693

2022-05-03

  • 08:20 taavi: redis: start replication from the old cluster to the new one (T278541)

2022-05-02

  • 08:54 taavi: restart acme-chief.service T307333

2022-04-25

  • 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 (T214343)
  • 14:46 bd808: Building toolforge-webservice v0.82

2022-04-23

  • 16:51 bd808: Built new perl532-sssd/{base,web} images and pushed to registry (T214343)

2022-04-20

2022-04-16

2022-04-12

  • 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' (T305986)
  • 21:27 bd808: Added komla to 'roots' sudoers policy (T305986)
  • 21:24 bd808: Add komla as projectadmin (T305986)

2022-04-10

  • 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)

2022-04-09

  • 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /

2022-04-08

  • 10:44 arturo: disabled debug mode on the k8s jobs-emailer component

2022-04-05

2022-04-04

2022-03-28

  • 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud (T304816) - cookbook ran by arturo@nostromo

2022-03-15

2022-03-14

  • 11:44 arturo: deploy jobs-framework-emailer 9470a5f (T286135)
  • 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo (T297090)

2022-03-10

  • 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902

2022-03-01

  • 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state (T302702)
  • 12:11 dcaro: Cleared error state queues for sgeexec-0916 (T302702)
  • 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand

2022-02-28

  • 08:02 taavi: reboot sgeexec-0916
  • 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /

2022-02-17

  • 08:23 taavi: deleted tools-clushmaster-02
  • 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access

2022-02-16

  • 00:12 bd808: Image builds completed.

2022-02-15

  • 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
  • 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
  • 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
  • 22:50 bd808: Built new toollabs-webservice 0.81
  • 18:43 bd808: Enabled puppet on tools-proxy-05
  • 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
  • 18:21 taavi: delete tools-package-builder-03
  • 11:49 arturo: invalidate sssd cache in all bastions to debug T301736
  • 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for T301736
  • 11:15 arturo: reboot tools-sgebastion-10 for T301736

2022-02-10

  • 15:07 taavi: shutdown tools-clushmaster-02 T298191
  • 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
  • 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally T214427
  • 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - T214427
  • 08:06 taavi: disable puppet globally for enabling puppetdb T214427

2022-02-09

  • 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet T214427
  • 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] (T277653) - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 18:25 arturo: ignore last message
  • 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
  • 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 T298191

2022-02-07

  • 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository (T288406)
  • 12:52 taavi: updated maintain-kubeusers for T301081

2022-02-04

  • 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with T301015
  • 21:36 taavi: clear error state from some webgrid nodes

2022-02-03

  • 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
  • 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate

2022-01-30

  • 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover T278541
  • 14:22 taavi: creating a cluster of 3 bullseye redis hosts for T278541

2022-01-26

  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
  • 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
  • 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
  • 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
  • 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
  • 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
  • 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
  • 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes (T277653)

2022-01-25

  • 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 11:44 arturo: rebooting buster exec nodes
  • 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4

2022-01-24

  • 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
  • 15:23 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-20

  • 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
  • 12:56 arturo: scaling up the grid with 10 buster exec nodes (T277653)

2022-01-19

  • 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move

2022-01-14

  • 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, T299243

2022-01-12

  • 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
  • 11:03 arturo: created puppet prefix 'tools-sgeweblig'
  • 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'

2022-01-04

  • 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
  • 08:12 taavi: disable puppet & exim4 on T298501

Archives