You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Stashbot
(bstorm_: T67777 - set the max_u_jobs global grid config setting to 50 in the new grid)
imported>Stashbot
(lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # T318858)
(529 intermediate revisions by 2 users not shown)
Line 1: Line 1:
=== 2019-01-07 ===
=== 2022-09-28 ===
* 17:21 bstorm_: [[phab:T67777|T67777]] - set the max_u_jobs global grid config setting to 50 in the new grid
* 21:23 lucaswerkmeister: on tools-sgebastion-10: run-puppet-agent # [[phab:T318858|T318858]]
* 15:54 bstorm_: [[phab:T67777|T67777]] Set stretch grid user job limit to 16
* 21:22 lucaswerkmeister: on tools-sgebastion-10: apt remove emacs-common emacs-bin-common # fix package conflict, [[phab:T318858|T318858]]
* 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.
* 21:15 lucaswerkmeister: added root SSH key for myself, manually ran puppet on tools-sgebastion-10 to apply it (seemingly successfully)


=== 2019-01-06 ===
=== 2022-09-22 ===
* 22:06 bd808: Added floating ip to tools-sgebastion-06 ([[phab:T212360|T212360]])
* 12:30 taavi: add TheresNoTime to the 'toollabs-trusted' gerrit group [[phab:T317438|T317438]]
* 12:27 taavi: add TheresNoTime as a project admin and to the roots sudo policy [[phab:T317438|T317438]]


=== 2019-01-05 ===
=== 2022-09-10 ===
* 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.
* 07:39 wm-bot2: removing instance tools-prometheus-03 - cookbook ran by taavi@runko


=== 2019-01-04 ===
=== 2022-09-07 ===
* 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history
* 10:22 dcaro: Pushing the new toolforge builder image based on the new 0.8 buildpacks ([[phab:T316854|T316854]])


=== 2019-01-03 ===
=== 2022-09-06 ===
* 21:03 bd808: Enabled Puppet on tools-proxy-02
* 08:06 dcaro_away: Published new toolforge-bullseye0-run and toolforge-bullseye0-build images for the toolforge buildpack builder ([[phab:T316854|T316854]])
* 20:53 bd808: Disabled Puppet on tools-proxy-02
* 20:51 bd808: Enabled Puppet on tools-proxy-01
* 20:49 bd808: Disabled Puppet on tools-proxy-01


=== 2018-12-21 ===
=== 2022-08-25 ===
* 16:29 andrewbogott: migrating tools-exec-1416  to labvirt1004
* 10:40 taavi: tagged new version of the python39-web container with a shell implementation of webservice-runner [[phab:T293552|T293552]]
* 16:01 andrewbogott: moving tools-grid-master to labvirt1004
* 00:35 bd808: Installed tools-manifest 0.14 for [[phab:T212390|T212390]]
* 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for [[phab:T212390|T212390]]
* 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for [[phab:T212390|T212390]]
* 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for [[phab:T212390|T212390]]


=== 2018-12-20 ===
=== 2022-08-24 ===
* 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
* 12:20 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|eba66bc}}) - cookbook ran by taavi@runko
* 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
* 12:20 taavi: upgrading ingress-nginx to v1.3
* 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002


=== 2018-12-17 ===
=== 2022-08-20 ===
* 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - [[phab:T212153|T212153]]
* 07:44 dcaro_away: all k8s nodes ready now \o/ ([[phab:T315718|T315718]])
* 19:18 gtirloni: decreased nfs-mount-manager verbosity ([[phab:T211817|T211817]])
* 07:43 dcaro_away: rebooted tools-k8s-control-2, seemed stuck trying to wait for tools home (nfs?), after reboot came back up ([[phab:T315718|T315718]])
* 19:02 arturo: [[phab:T211977|T211977]] add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
* 07:41 dcaro_away: cloudvirt1023 down took out 3 workers, 1 control, and a grid exec and a weblight, they are taking long to restart, looking ([[phab:T315718|T315718]])
* 13:46 arturo: [[phab:T211977|T211977]] `aborrero@tools-services-01:~$  sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`


=== 2018-12-11 ===
=== 2022-08-18 ===
* 13:19 gtirloni: Removed BigBrother ([[phab:T208357|T208357]])
* 14:45 andrewbogott: adding lucaswerkmeister  as projectadmin ([[phab:T314527|T314527]])
* 14:43 andrewbogott: removing some inactive projectadmins: rush, petrb, mdipietro, jeh, krenair


=== 2018-12-05 ===
=== 2022-08-17 ===
* 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster ([[phab:T196973|T196973]])
* 16:34 taavi: kubectl sudo delete cm -n tool-wdml maintain-kubeusers # [[phab:T315459|T315459]]
* 08:30 taavi: failing the grid from the shadow back to the master, some disruption expected


=== 2018-12-04 ===
=== 2022-08-16 ===
* 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage [[phab:T164123|T164123]]
* 17:28 taavi: fail over docker-registry, tools-docker-registry-06->docker-registry-05
* 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 ([[phab:T164123|T164123]])


=== 2018-12-01 ===
=== 2022-08-11 ===
* 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 ([[phab:T194615|T194615]])
* 16:57 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by taavi@runko
* 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts
* 16:55 taavi: restart puppetdb on tools-puppetdb-1, crashed during the ceph issues


=== 2018-11-30 ===
=== 2022-08-05 ===
* 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
* 15:08 wm-bot2: removing grid node tools-sgewebgen-10-1.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 22:18 gtirloni: Pushed new jdk8 docker image based on stretch ([[phab:T205774|T205774]])
* 15:05 wm-bot2: removing grid node tools-sgeexec-10-12.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance ([[phab:T194615|T194615]])
* 15:00 wm-bot2: created node tools-sgewebgen-10-3.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2018-11-27 ===
=== 2022-08-03 ===
* 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb
* 15:51 dhinus: recreated jobs-api pods to pick up new ConfigMap
* 15:02 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|c47ac41}}) - cookbook ran by fran@MacBook-Pro.station


=== 2018-11-26 ===
=== 2022-07-20 ===
* 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) ([[phab:T210190|T210190]])
* 19:31 taavi: reboot toolserver-proxy-01 to free up disk space probably held by stale file handles
* 17:34 gtirloni: [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (again)
* 08:06 wm-bot2: removing grid node tools-sgeexec-10-6.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:31 gtirloni: deleted instance tools-clushmaster-01 ([[phab:T209701|T209701]])


=== 2018-11-20 ===
=== 2022-07-19 ===
* 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
* 17:53 wm-bot2: created node tools-sgeexec-10-21.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
* 17:00 wm-bot2: removing grid node tools-sgeexec-10-3.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
* 16:58 wm-bot2: removing grid node tools-sgeexec-10-4.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:52 arturo: [[phab:T208579|T208579]] distributing now misctools and jobutils 1.33 in all aptly repos
* 16:24 wm-bot2: created node tools-sgeexec-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 09:43 godog: restart prometheus@tools on prometheus-01
* 15:59 taavi: tag current maintain-kubernetes :beta image as: :latest


=== 2018-11-16 ===
=== 2022-07-17 ===
* 21:16 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
* 15:52 wm-bot2: removing grid node tools-sgeexec-10-10.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:47 gtirloni: deleted tools-mail instance
* 15:43 wm-bot2: removing grid node tools-sgeexec-10-2.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
* 13:26 wm-bot2: created node tools-sgeexec-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
* 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades


=== 2018-11-14 ===
=== 2022-07-14 ===
* 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
* 13:48 taavi: rebooting tools-sgeexec-10-2
* 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
* 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009


=== 2018-11-13 ===
=== 2022-07-13 ===
* 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo ([[phab:T207970|T207970]])
* 12:09 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus
* 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
* 13:29 gtirloni: Changed active mail relay to tools-mail-02 ([[phab:T209356|T209356]])
* 13:22 arturo: [[phab:T207970|T207970]] misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
* 13:05 arturo: [[phab:T207970|T207970]] there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
* 12:59 arturo: the puppet issue has been solved by reverting the code
* 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit


=== 2018-11-08 ===
=== 2022-07-11 ===
* 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
* 16:06 wm-bot2: Increased quotas by <nowiki>{</nowiki>self.increases<nowiki>}</nowiki> ([[phab:T312692|T312692]]) - cookbook ran by nskaggs@x1carbon
* 17:58 arturo: installing jobutils and misctools v1.32 ([[phab:T207970|T207970]])
* 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
* 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
* 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
* 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
* 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
* 11:32 gtirloni: removed temporary /var/mail fix ([[phab:T208843|T208843]])


=== 2018-11-07 ===
=== 2022-07-07 ===
* 10:37 gtirloni: removed invalid apt.conf.d file from all hosts ([[phab:T110055|T110055]])
* 07:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master - cookbook ran by dcaro@vulcanus


=== 2018-11-02 ===
=== 2022-06-28 ===
* 18:11 arturo: [[phab:T206223|T206223]] some disturbances due to the certificate renewal
* 17:34 wm-bot2: cleaned up grid queue errors on tools-sgegrid-master ([[phab:T311538|T311538]]) - cookbook ran by dcaro@vulcanus
* 17:04 arturo: renewing *.wmflabs.org [[phab:T206223|T206223]]
* 15:51 taavi: add 4096G cinder quota [[phab:T311509|T311509]]


=== 2018-10-31 ===
=== 2022-06-27 ===
* 18:02 gtirloni: truncated big .err and error.log files
* 18:14 taavi: restart calico, appears to have got stuck after the ca replacement operation
* 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde
* 18:02 taavi: switchover active cron server to tools-sgecron-2 [[phab:T284767|T284767]]
* 17:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0915.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:52 wm-bot2: removing grid node tools-sgewebgrid-generic-0902.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgeexec-0942.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:15 taavi: [[phab:T311412|T311412]] updating ca used by k8s-apiserver->etcd communication, breakage may happen
* 14:58 taavi: renew puppet ca cert and certificate for tools-puppetmaster-02 [[phab:T311412|T311412]]
* 14:50 taavi: backup /var/lib/puppet/server to /root/puppet-ca-backup-2022-06-27.tar.gz on tools-puppetmaster-02 before we do anything else to it [[phab:T311412|T311412]]


=== 2018-10-29 ===
=== 2022-06-23 ===
* 17:00 bd808: Ran grid engine orphan process kill script from [[phab:T153281|T153281]]
* 17:51 wm-bot2: removing grid node tools-sgeexec-0941.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:49 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0916.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:46 wm-bot2: removing grid node tools-sgewebgrid-generic-0901.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:32 wm-bot2: removing grid node tools-sgeexec-0939.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:30 wm-bot2: removing grid node tools-sgeexec-0938.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:27 wm-bot2: removing grid node tools-sgeexec-0937.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:22 wm-bot2: removing grid node tools-sgeexec-0936.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:19 wm-bot2: removing grid node tools-sgeexec-0935.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:17 wm-bot2: removing grid node tools-sgeexec-0934.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:14 wm-bot2: removing grid node tools-sgeexec-0933.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:11 wm-bot2: removing grid node tools-sgeexec-0932.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 17:09 wm-bot2: removing grid node tools-sgeexec-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:30 wm-bot2: removing grid node tools-sgeexec-0947.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 13:59 taavi: removing remaining continuous jobs from the stretch grid [[phab:T277653|T277653]]


=== 2018-10-26 ===
=== 2022-06-22 ===
* 10:34 arturo: [[phab:T207970|T207970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 15:54 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0917.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 10:32 arturo: [[phab:T209970|T209970]] added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
* 15:51 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0918.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:47 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0919.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:45 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0920.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2018-10-19 ===
=== 2022-06-21 ===
* 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
* 15:23 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017
* 15:20 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0914.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:18 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0913.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko
* 15:07 wm-bot2: removing grid node tools-sgewebgrid-lighttpd-0912.tools.eqiad1.wikimedia.cloud - cookbook ran by taavi@runko


=== 2018-10-18 ===
=== 2022-06-03 ===
* 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017
* 20:07 wm-bot2: created node tools-sgeweblight-10-26.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:51 balloons: Scaling webservice nodes to 20, using new 8G swap flavor [[phab:T309821|T309821]]
* 19:35 wm-bot2: created node tools-sgeweblight-10-25.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:03 wm-bot2: created node tools-sgeweblight-10-20.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:01 wm-bot2: created node tools-sgeweblight-10-19.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 19:00 balloons: depooled old nodes, bringing entirely new grid of nodes online [[phab:T309821|T309821]]
* 18:22 wm-bot2: created node tools-sgeweblight-10-17.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:54 wm-bot2: created node tools-sgeweblight-10-16.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 17:52 wm-bot2: created node tools-sgeweblight-10-15.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 16:59 andrewbogott: building a bunch of new lighttpd nodes (beginning with tools-sgeweblight-10-12) using a flavor with more swap space
* 16:56 wm-bot2: created node tools-sgeweblight-10-12.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by andrew@buster
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor [[phab:T309821|T309821]]
* 15:50 balloons: temp add 1.0G swap to sgeweblight hosts [[phab:T309821|T309821]]
* 15:50 balloons: fix fix g3.cores4.ram8.disk20.swap24.ephem20 flavor to include swap. Convert to fix g3.cores4.ram8.disk20.swap8.ephem20 flavor t309821
* 15:49 balloons: temp add 1.0G swap to sgeweblight hosts t309821
* 13:25 bd808: Upgrading fleet to tools-webservice 0.86 ([[phab:T309821|T309821]])
* 13:20 bd808: publish tools-webservice 0.86 ([[phab:T309821|T309821]])
* 12:46 taavi: start webservicemonitor on tools-sgecron-01 [[phab:T309821|T309821]]
* 10:36 taavi: draining each sgeweblight node one by one, and removing the jobs stuck in 'deleting' too
* 05:05 taavi: removing duplicate (there should be only one per tool) web service jobs from the grid [[phab:T309821|T309821]]
* 04:52 taavi: revert bd808's changes to profile::toolforge::active_proxy_host
* 03:21 bd808: Cleared queue error states after deploying new toolforge-webservice package ([[phab:T309821|T309821]])
* 03:10 bd808: publish tools-webservice 0.85 with hack for [[phab:T309821|T309821]]


=== 2018-10-16 ===
=== 2022-06-02 ===
* 15:13 bd808: (repost for gtirloni) [[phab:T186571|T186571]] removed legofan4000 user from project-tools group (leftover from [[phab:T165624|T165624]] legofan4000->macfan4000 rename)
* 22:26 bd808: Rebooting tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud. Node is full of jobs that are not tracked by grid master and failing to spawn new jobs sent by the scheduler
* 21:56 bd808: Removed legacy "active_proxy_host" hiera setting
* 21:55 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for profile::toolforge::active_proxy_host key
* 21:41 bd808: Updated hiera to use fqdn of 'tools-proxy-06.tools.eqiad1.wikimedia.cloud' for active_redis key
* 21:23 wm-bot2: created node tools-sgeweblight-10-8.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:42 wm-bot2: rebooting stretch exec grid workers - cookbook ran by taavi@runko
* 12:13 wm-bot2: created node tools-sgeweblight-10-7.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 12:03 dcaro: refresh prometheus certs ([[phab:T308402|T308402]])
* 11:47 dcaro: refresh registry-admission-controller certs ([[phab:T308402|T308402]])
* 11:42 dcaro: refresh ingress-admission-controller certs ([[phab:T308402|T308402]])
* 11:36 dcaro: refresh volume-admission-controller certs ([[phab:T308402|T308402]])
* 11:24 wm-bot2: created node tools-sgeweblight-10-6.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko
* 11:17 taavi: publish jobutils 1.44 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 10:16 taavi: publish tools-webservice 0.84 that updates the grid default from stretch to buster [[phab:T277653|T277653]]
* 09:54 wm-bot2: created node tools-sgeexec-10-14.tools.eqiad1.wikimedia.cloud and added it to the grid - cookbook ran by taavi@runko


=== 2018-10-07 ===
=== 2022-06-01 ===
* 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 [[phab:T194859|T194859]]
* 11:18 taavi: depool and remove tools-sgeexec-09[07-14]
* 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
* 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens


=== 2018-09-21 ===
=== 2022-05-31 ===
* 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
* 16:51 taavi: delete tools-sgeexec-0904 for [[phab:T309525|T309525]] experimentation
* 12:14 arturo: [[phab:T205078|T205078]] same for {jessie,stretch}-wikimedia
* 12:12 arturo: [[phab:T205078|T205078]] upgrade trusty-wikimedia packages (git-fat, debmonitor)
* 11:57 arturo: [[phab:T205078|T205078]] purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines


=== 2018-09-17 ===
=== 2022-05-30 ===
* 09:13 arturo: [[phab:T204481|T204481]] aborrero@tools-mail:~$ sudo exiqgrep -i {{!}} xargs sudo exim -Mrm
* 08:24 taavi: depool tools-sgeexec-[0901-0909] (7 nodes total) [[phab:T277653|T277653]]


=== 2018-09-14 ===
=== 2022-05-26 ===
* 11:22 arturo: [[phab:T204267|T204267]] stop the corhist tool (k8s) because is hammering the wikidata API
* 15:39 wm-bot2: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T309146|T309146]]) - cookbook ran by taavi@runko
* 10:51 arturo: [[phab:T204267|T204267]] stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API


=== 2018-09-08 ===
=== 2022-05-22 ===
* 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog ([[phab:T196137|T196137]])
* 17:04 taavi: failover tools-redis to the updated cluster [[phab:T278541|T278541]]
* 16:42 wm-bot2: removing grid node tools-sgeexec-0940.tools.eqiad1.wikimedia.cloud ([[phab:T308982|T308982]]) - cookbook ran by taavi@runko


=== 2018-09-07 ===
=== 2022-05-16 ===
* 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb
* 14:02 wm-bot2: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/ingress-nginx ({{Gerrit|7037eca}}) - cookbook ran by taavi@runko


=== 2018-08-27 ===
=== 2022-05-14 ===
* 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]
* 10:47 taavi: hard reboot unresponsible tools-sgeexec-0940
* 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
* 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` [[phab:T202932|T202932]]


=== 2018-08-22 ===
=== 2022-05-12 ===
* 13:02 arturo: I used this command: `sudo exim -bp {{!}} sudo exiqgrep -i {{!}} xargs sudo exim -Mrm`
* 12:36 taavi: re-enable CronJobControllerV2 [[phab:T308205|T308205]]
* 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com
* 09:28 taavi: deploy jobs-api update [[phab:T308204|T308204]]
* 09:15 wm-bot2: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|e6fa299}}) ([[phab:T308204|T308204]]) - cookbook ran by taavi@runko


=== 2018-08-19 ===
=== 2022-05-10 ===
* 09:12 legoktm: rebuilding python/base k8s images for https://gerrit.wikimedia.org/r/453665 ([[phab:T202218|T202218]])
* 15:18 taavi: depool tools-k8s-worker-42 for experiments
* 13:54 taavi: enable distro-wikimedia unattended upgrades [[phab:T290494|T290494]]


=== 2018-08-14 ===
=== 2022-05-06 ===
* 21:02 legoktm: rebuilt php7.2 docker images for https://gerrit.wikimedia.org/r/452755
* 19:46 bd808: Rebuilt toolforge-perl532-sssd-base & toolforge-perl532-sssd-web to add liblocale-codes-perl ([[phab:T307812|T307812]])
* 01:08 legoktm: switched tools.coverme and tools.wikiinfo to use PHP 7.2


=== 2018-08-13 ===
=== 2022-05-05 ===
* 23:31 legoktm: rebuilding docker images for webservice upgrade
* 17:28 taavi: deploy tools-webservice 0.83 [[phab:T307693|T307693]]
* 23:16 legoktm: published toollabs-webservice_0.41_all.deb
* 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice


=== 2018-08-09 ===
=== 2022-05-03 ===
* 10:40 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-backports (excluding python-designateclient)
* 08:20 taavi: redis: start replication from the old cluster to the new one ([[phab:T278541|T278541]])
* 10:30 arturo: [[phab:T201602|T201602]] upgrade packages from jessie-wikimedia
* 10:27 arturo: [[phab:T201602|T201602]] upgrade packages from trusty-updates


=== 2018-08-08 ===
=== 2022-05-02 ===
* 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images [[phab:T156626|T156626]] [[phab:T148872|T148872]] [[phab:T158244|T158244]]
* 08:54 taavi: restart acme-chief.service [[phab:T307333|T307333]]


=== 2018-08-06 ===
=== 2022-04-25 ===
* 12:33 arturo: [[phab:T197176|T197176]] installing texlive-full in toolforge
* 14:56 bd808: Rebuilding all docker images to pick up toolforge-webservice v0.82 ([[phab:T214343|T214343]])
* 14:46 bd808: Building toolforge-webservice v0.82


=== 2018-08-01 ===
=== 2022-04-23 ===
* 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break
* 16:51 bd808: Built new perl532-sssd/<nowiki>{</nowiki>base,web<nowiki>}</nowiki> images and pushed to registry ([[phab:T214343|T214343]])


=== 2018-07-30 ===
=== 2022-04-20 ===
* 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
* 16:58 taavi: reboot toolserver-proxy-01 to free up disk space from stale file handles(?)
* 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools
* 07:51 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|8f37a04}}) - cookbook ran by taavi@runko


=== 2018-07-27 ===
=== 2022-04-16 ===
* 04:52 zhuyifei1999_: rebuilding python/base docker container [[phab:T190274|T190274]]
* 18:53 wm-bot: deployed kubernetes component https://gitlab.wikimedia.org/repos/cloud/toolforge/kubernetes-metrics ({{Gerrit|2c485e9}}) - cookbook ran by taavi@runko


=== 2018-07-25 ===
=== 2022-04-12 ===
* 19:02 chasemp: tools-worker-1004 reboot
* 21:32 bd808: Added komla to Gerrit group 'toollabs-trusted' ([[phab:T305986|T305986]])
* 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)
* 21:27 bd808: Added komla to 'roots' sudoers policy ([[phab:T305986|T305986]])
* 21:24 bd808: Add komla as projectadmin ([[phab:T305986|T305986]])


=== 2018-07-18 ===
=== 2022-04-10 ===
* 13:24 arturo: upgrading packages from `stretch-wikimedia` [[phab:T199905|T199905]]
* 18:43 taavi: deleted `/tmp/dwl02.out-20210915` on tools-sgebastion-07 (not touched since september, taking up 1.3G of disk space)
* 13:18 arturo: upgrading packages from `stable` [[phab:T199905|T199905]]
* 12:51 arturo: upgrading packages from `oldstable` [[phab:T199905|T199905]]
* 12:31 arturo: upgrading packages from `trusty-updates` [[phab:T199905|T199905]]
* 12:16 arturo: upgrading packages from `jessie-wikimedia` [[phab:T199905|T199905]]
* 12:09 arturo: upgrading packages from `trusty-wikimedia` [[phab:T199905|T199905]]


=== 2018-06-30 ===
=== 2022-04-09 ===
* 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
* 15:30 taavi: manually prune user.log on tools-prometheus-03 to free up some space on /
* 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
* 16:39 zhuyifei1999_: reboot tools-paws-master-01
* 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
* 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere


=== 2018-06-29 ===
=== 2022-04-08 ===
* 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
* 10:44 arturo: disabled debug mode on the k8s jobs-emailer component
* 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU ([[phab:T123121|T123121]])
* 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. [[phab:T182070|T182070]]


=== 2018-06-28 ===
=== 2022-04-05 ===
* 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
* 07:52 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
* 07:44 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|d7d3463}}) - cookbook ran by arturo@nostromo
* 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
* 07:21 arturo: deploying toolforge-jobs-framework-cli v7
* 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
* 16:48 arturo: rebooting tools-docker-registry-01
* 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
* 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck


=== 2018-06-21 ===
=== 2022-04-04 ===
* 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash
* 17:05 wm-bot: deployed kubernetes component https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 16:56 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-api:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-api ({{Gerrit|cbcfc47}}) - cookbook ran by arturo@nostromo
* 09:28 arturo: deployed toolforge-jobs-framework-cli v6 into aptly and installed it on buster bastions


=== 2018-06-20 ===
=== 2022-03-28 ===
* 15:09 bd808: Killed orphan processes on webgrid nodes ([[phab:T182070|T182070]]); most owned by jembot and croptool
* 09:32 wm-bot: cleaned up grid queue errors on tools-sgegrid-master.tools.eqiad1.wikimedia.cloud ([[phab:T304816|T304816]]) - cookbook ran by arturo@nostromo


=== 2018-06-14 ===
=== 2022-03-15 ===
* 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash
* 16:57 wm-bot: build & push docker image docker-registry.tools.wmflabs.org/toolforge-jobs-framework-emailer:latest from https://gerrit.wikimedia.org/r/cloud/toolforge/jobs-framework-emailer ({{Gerrit|084ee51}}) - cookbook ran by arturo@nostromo
* 11:24 arturo: cleared error state on queue continuous@tools-sgeexec-0939.tools.eqiad.wmflabs (a job took a very long time to be scheduled...)


=== 2018-06-11 ===
=== 2022-03-14 ===
* 10:11 arturo: [[phab:T196137|T196137]] `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null {{!}} grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart {{!}}{{!}} true'`
* 11:44 arturo: deploy jobs-framework-emailer {{Gerrit|9470a5f339fd5a44c97c69ce97239aef30f5ee41}} ([[phab:T286135|T286135]])
* 10:48 dcaro: pushed v0.33.2 tekton control and webhook images, and bashA5.1.4 to the local repo ([[phab:T297090|T297090]])


=== 2018-06-08 ===
=== 2022-03-10 ===
* 07:46 arturo: [[phab:T196137|T196137]] more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes
* 09:42 arturo: cleaned grid queue error state @ tools-sgewebgrid-generic-0902


=== 2018-06-07 ===
=== 2022-03-01 ===
* 11:01 arturo: [[phab:T196137|T196137]] force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`
* 13:41 dcaro: rebooting tools-sgeexec-0916 to clear any state ([[phab:T302702|T302702]])
* 12:11 dcaro: Cleared error state queues for sgeexec-0916 ([[phab:T302702|T302702]])
* 10:23 arturo: tools-sgeeex-0913/0916 are depooled, queue errors. Reboot them and clean errors by hand


=== 2018-06-06 ===
=== 2022-02-28 ===
* 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt ([[phab:T196589|T196589]])
* 08:02 taavi: reboot sgeexec-0916
* 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
* 07:49 taavi: depool tools-sgeexec-0916.tools as it is out of disk space on /
* 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
* 19:04 chasemp: tools-bastion-03 is virtually unusable
* 09:49 arturo: [[phab:T196137|T196137]] aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid


=== 2018-06-05 ===
=== 2022-02-17 ===
* 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben ([[phab:T196486|T196486]])
* 08:23 taavi: deleted tools-clushmaster-02
* 17:39 arturo: [[phab:T196137|T196137]] clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
* 08:14 taavi: made tools-puppetmaster-02 its own client to fix `puppet node deactivate` puppetdb access
* 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs ([[phab:T196486|T196486]])


=== 2018-06-04 ===
=== 2022-02-16 ===
* 10:28 arturo: [[phab:T196006|T196006]] installing sqlite3 package in exec nodes
* 00:12 bd808: Image builds completed.


=== 2018-06-03 ===
=== 2022-02-15 ===
* 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' [[phab:T195834|T195834]]
* 23:17 bd808: Image builds failed in buster php image with an apt error. The error looks transient, so starting builds over.
* 23:06 bd808: Started full rebuild of Toolforge containers to pick up webservice 0.81 and other package updates in tmux session on tools-docker-imagebuilder-01
* 22:58 bd808: `sudo apt-get update && sudo apt-get install toolforge-webservice` on all bastions to pick up 0.81
* 22:50 bd808: Built new toollabs-webservice 0.81
* 18:43 bd808: Enabled puppet on tools-proxy-05
* 18:38 bd808: Disabled puppet on tools-proxy-05 for manual testing of nginx config changes
* 18:21 taavi: delete tools-package-builder-03
* 11:49 arturo: invalidate sssd cache in all bastions to debug [[phab:T301736|T301736]]
* 11:16 arturo: purge debian package `unscd` on tools-sgebastion-10/11 for [[phab:T301736|T301736]]
* 11:15 arturo: reboot tools-sgebastion-10 for [[phab:T301736|T301736]]


=== 2018-05-31 ===
=== 2022-02-10 ===
* 11:31 zhuyifei1999_: building & pushing python/web docker image [[phab:T174769|T174769]]
* 15:07 taavi: shutdown tools-clushmaster-02 [[phab:T298191|T298191]]
* 11:13 zhuyifei1999_: force puppet run on tools-worker-1001 to check the impact of https://gerrit.wikimedia.org/r/#/c/433101
* 13:25 wm-bot: trying to join node tools-sgewebgen-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:24 wm-bot: trying to join node tools-sgewebgen-10-1 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:07 wm-bot: trying to join node tools-sgeweblight-10-5 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:06 wm-bot: trying to join node tools-sgeweblight-10-4 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:05 wm-bot: trying to join node tools-sgeweblight-10-3 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 13:03 wm-bot: trying to join node tools-sgeweblight-10-2 to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 12:54 wm-bot: trying to join node tools-sgeweblight-10-1.tools.eqiad1.wikimedia.cloud to the grid cluster in tools. - cookbook ran by arturo@nostromo
* 08:45 taavi: set `profile::base::manage_ssh_keys: true` globally [[phab:T214427|T214427]]
* 08:16 taavi: enable puppetdb and re-enable puppet with puppetdb ssh key management disabled (profile::base::manage_ssh_keys: false) - [[phab:T214427|T214427]]
* 08:06 taavi: disable puppet globally for enabling puppetdb [[phab:T214427|T214427]]


=== 2018-05-30 ===
=== 2022-02-09 ===
* 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
* 19:29 taavi: installed tools-puppetdb-1, not configured on puppetmaster side yet [[phab:T214427|T214427]]
* 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
* 18:56 wm-bot: pooled 10 grid nodes tools-sgeweblight-10-[1-5],tools-sgewebgen-10-[1,2],tools-sgeexec-10-[1-10] ([[phab:T277653|T277653]]) - cookbook ran by arturo@nostromo
* 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close [[phab:T195834|T195834]]
* 18:30 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 18:25 arturo: ignore last message
* 18:24 wm-bot: pooled 9 grid nodes tools-sgeexec-10-[2-10],tools-sgewebgen-[3,15] - cookbook ran by arturo@nostromo
* 14:04 taavi: created tools-cumin-1/toolsbeta-cumin-1 [[phab:T298191|T298191]]


=== 2018-05-28 ===
=== 2022-02-07 ===
* 12:09 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
* 17:37 taavi: generated authdns_acmechief ssh key and stored password in a text file in local labs/private repository ([[phab:T288406|T288406]])
* 12:06 arturo: [[phab:T194665|T194665]] adding mono packages to apt.wikimedia.org for trusty-wikimedia
* 12:52 taavi: updated maintain-kubeusers for [[phab:T301081|T301081]]


=== 2018-05-25 ===
=== 2022-02-04 ===
* 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty [[phab:T195558|T195558]]
* 22:33 taavi: `root@tools-sgebastion-10:/data/project/ru_monuments/.kube# mv config old_config` # experimenting with [[phab:T301015|T301015]]
* 21:36 taavi: clear error state from some webgrid nodes


=== 2018-05-22 ===
=== 2022-02-03 ===
* 11:53 arturo: running puppet to deploy https://gerrit.wikimedia.org/r/#/c/433996/ for [[phab:T194665|T194665]] (mono framework update)
* 09:06 taavi: run `sudo apt-get clean` on login-buster/dev-buster to clean up disk space
* 08:01 taavi: restart acme-chief to force renewal of toolserver.org certificate


=== 2018-05-18 ===
=== 2022-01-30 ===
* 16:36 bd808: Restarted bigbrother on tools-services-02
* 14:41 taavi: created a neutron port with ip 172.16.2.46 for a service ip for toolforge redis automatic failover [[phab:T278541|T278541]]
* 14:22 taavi: creating a cluster of 3 bullseye redis hosts for [[phab:T278541|T278541]]


=== 2018-05-16 ===
=== 2022-01-26 ===
* 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-10 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-9 - cookbook ran by arturo@nostromo
* 18:33 wm-bot: depooled grid node tools-sgeexec-10-8 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-7 - cookbook ran by arturo@nostromo
* 18:32 wm-bot: depooled grid node tools-sgeexec-10-6 - cookbook ran by arturo@nostromo
* 18:31 wm-bot: depooled grid node tools-sgeexec-10-5 - cookbook ran by arturo@nostromo
* 18:30 wm-bot: depooled grid node tools-sgeexec-10-4 - cookbook ran by arturo@nostromo
* 18:28 wm-bot: depooled grid node tools-sgeexec-10-3 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-2 - cookbook ran by arturo@nostromo
* 18:27 wm-bot: depooled grid node tools-sgeexec-10-1 - cookbook ran by arturo@nostromo
* 13:55 arturo: scaling up the buster web grid with 5 lighttd and 2 generic nodes ([[phab:T277653|T277653]])


=== 2018-05-15 ===
=== 2022-01-25 ===
* 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414.  It's hanging for unknown reasons.
* 11:50 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
* 11:44 arturo: rebooting buster exec nodes
* 04:05 zhuyifei1999_: Force deletion of grid job {{Gerrit|5221417}} (tools.giftbot sga), host tools-exec-1414 not responding
* 08:34 taavi: sign puppet certificate for tools-sgeexec-10-4


=== 2018-05-12 ===
=== 2022-01-24 ===
* 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop {{!}} [[phab:T194343|T194343]]
* 17:44 wm-bot: reconfiguring the grid by using grid-configurator - cookbook ran by arturo@nostromo
* 15:23 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2018-05-11 ===
=== 2022-01-20 ===
* 14:34 andrewbogott: repooling labvirt1001 tools instances
* 17:05 arturo: drop 9 of the 10 buster exec nodes created earlier. They didn't get DNS records
* 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for [[phab:T194258|T194258]]:  tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
* 12:56 arturo: scaling up the grid with 10 buster exec nodes ([[phab:T277653|T277653]])


=== 2018-05-10 ===
=== 2022-01-19 ===
* 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update
* 17:34 andrewbogott: rebooting tools-sgeexec-0913.tools.eqiad1.wikimedia.cloud to recover from (presumed) fallout from the scratch/nfs move


=== 2018-05-09 ===
=== 2022-01-14 ===
* 21:11 Reedy: Added Tim Starling as member/admin
* 19:09 taavi: set /var/run/lighttpd as world-writable on all lighttpd webgrid nodes, [[phab:T299243|T299243]]


=== 2018-05-07 ===
=== 2022-01-12 ===
* 21:02 zhuyifei1999_: re-building all docker images [[phab:T190893|T190893]]
* 11:27 arturo: created puppet prefix `tools-sgeweblight`, drop `tools-sgeweblig`
* 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 [[phab:T190893|T190893]]
* 11:03 arturo: created puppet prefix 'tools-sgeweblig'
* 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours
* 11:02 arturo: created puppet prefix 'toolsbeta-sgeweblig'


=== 2018-05-05 ===
=== 2022-01-04 ===
* 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing
* 17:18 bd808: tools-acme-chief-01: sudo service acme-chief restart
* 08:12 taavi: disable puppet & exim4 on [[phab:T298501|T298501]]


=== 2018-05-03 ===
==Archives==
* 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package [[phab:T192566|T192566]]
* [[Nova Resource:Tools/SAL/Archive 1|Archive 1]] (2013-2014)
 
* [[Nova Resource:Tools/SAL/Archive 2|Archive 2]] (2015-2017)
=== 2018-05-01 ===
* [[Nova Resource:Tools/SAL/Archive 3|Archive 3]] (2018-2019)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)
* [[Nova Resource:Tools/SAL/Archive 4|Archive 4]] (2020-2021)
 
=== 2018-04-27 ===
* 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
* 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker
 
=== 2018-04-23 ===
* 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools [[phab:T192732|T192732]]
 
=== 2018-04-22 ===
* 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd {{!}} grep -E "    1 " {{!}} grep php-cgi {{!}} xargs sudo kill -9'`
 
=== 2018-04-15 ===
* 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] [[phab:T192224|T192224]]
* 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci [[phab:T192224|T192224]]
 
=== 2018-04-11 ===
* 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure
 
=== 2018-04-06 ===
* 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
* 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to [[phab:T159254|T159254]]
* 11:23 arturo: manually upgrade apache2 on tools-puppemaster for [[phab:T159254|T159254]]
 
=== 2018-04-05 ===
* 18:46 chicocvenancio: killed wget that was hogging io
 
=== 2018-03-29 ===
* 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
* 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done
 
=== 2018-03-28 ===
* 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid
 
=== 2018-03-26 ===
* 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-23 ===
* 23:26 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
* 19:43 bd808: tools-proxy-* Forced puppet run to apply https://gerrit.wikimedia.org/r/#/c/421472/
 
=== 2018-03-22 ===
* 22:04 bd808: Forced puppet run on tools-proxy-02 for [[phab:T130748|T130748]]
* 21:52 bd808: Forced puppet run on tools-proxy-01 for [[phab:T130748|T130748]]
* 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
* 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'
 
=== 2018-03-21 ===
* 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
* 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid ([[phab:T190185|T190185]])
 
=== 2018-03-20 ===
* 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) [[phab:T189018|T189018]] [[phab:T190126|T190126]]
 
=== 2018-03-19 ===
* 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools
 
=== 2018-03-16 ===
* 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
* 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp
 
=== 2018-03-15 ===
* 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot [[phab:T185624|T185624]]
 
=== 2018-03-14 ===
* 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 ([[phab:T181531|T181531]])
* 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 ([[phab:T181531|T181531]])
* 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 ([[phab:T181531|T181531]])
* 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
* 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
* 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full
 
=== 2018-03-12 ===
* 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
* 17:13 arturo: [[phab:T188994|T188994]] upgrading packages from `stable`
* 16:53 arturo: [[phab:T188994|T188994]] upgrading packages from stretch-wikimedia
* 16:33 arturo: [[phab:T188994|T188994]] upgrading packages form jessie-wikimedia
* 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 {{Gerrit|5f3561e}} [[phab:T189430|T189430]]
* 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
* 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
* 13:19 arturo: [[phab:T188994|T188994]] upgrade packages from jessie-backports in all jessie servers
* 12:49 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-updates in all ubuntu servers
* 12:34 arturo: [[phab:T188994|T188994]] upgrade packages from trusty-wikimedia in all ubuntu servers
 
=== 2018-03-08 ===
* 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
* 14:02 arturo: [[phab:T188994|T188994]] upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server
 
=== 2018-03-07 ===
* 20:42 chicocvenancio: killed io intensive recursive zip of huge folder
* 18:30 madhuvishy: Killed php-cgi job run by user 51242 on tools-webgrid-lighttpd-1413
* 14:08 arturo: just merged NFS package pinning https://gerrit.wikimedia.org/r/#/c/416943/
* 13:47 arturo: deploying more apt pinnings: https://gerrit.wikimedia.org/r/#/c/416934/
 
=== 2018-03-06 ===
* 16:15 madhuvishy: Reboot tools-docker-registry-02 [[phab:T189018|T189018]]
* 15:50 madhuvishy: Rebooting tools-worker-1011
* 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
* 15:03 arturo: drain and reboot tools-worker-1011
* 15:03 chasemp: rebooted tools-worker 1001-1008
* 14:58 arturo: drain and reboot tools-worker-1010
* 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
* 14:27 chasemp: reboot tools-worker-100[12]
* 14:23 chasemp: downtime icinga alert for k8s workers ready
* 13:21 arturo: [[phab:T188994|T188994]] in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
* 12:58 arturo: [[phab:T188994|T188994]] upgrading packages in jessie nodes from the oldstable source
* 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
* 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
* 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic ([[phab:T188911|T188911]])
* 11:33 arturo: removing unused kernel packages in ubuntu nodes
* 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster
 
=== 2018-03-05 ===
* 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
* 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb [[phab:T167026|T167026]] [[phab:T181492|T181492]]
* 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for [[phab:T188911|T188911]]
* 14:01 arturo: deleting old kernel packages in jessie instances for [[phab:T188911|T188911]]
* 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
* 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for [[phab:T187193|T187193]]
* 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for [[phab:T187193|T187193]]
 
=== 2018-03-02 ===
* 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon
 
=== 2018-03-01 ===
* 13:27 arturo: deploy https://gerrit.wikimedia.org/r/#/c/415057/
 
=== 2018-02-27 ===
* 17:37 chasemp: add chico as admin to toolsbeta
* 12:23 arturo: running `apt-get autoclean` in canary servers
* 12:16 arturo: running `apt-get autoremove` in canary servers
 
=== 2018-02-26 ===
* 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
* 10:35 arturo: enable puppet in tools-proxy-01
* 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests
 
=== 2018-02-25 ===
* 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals
 
=== 2018-02-23 ===
* 19:11 arturo: enable puppet in tools-proxy-01
* 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
* 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
* 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded
 
=== 2018-02-22 ===
* 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server
 
=== 2018-02-21 ===
* 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
* 18:15 arturo: puppet should be fine across the fleet
* 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
* 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
* 16:59 arturo: puppet is broken across the cluster due to last change
* 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
* 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
* 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
* 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
* 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
* 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
* 09:18 chicocvenancio: killed io intensive tool job in bastion
* 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...
 
=== 2018-02-20 ===
* 12:42 arturo: upgrading tools-flannel-etcd-01
* 12:42 arturo: upgrading tools-k8s-etcd-01
 
=== 2018-02-19 ===
* 19:13 arturo: upgrade all packages of tools-services-01
* 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
* 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
* 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration
 
=== 2018-02-16 ===
* 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
* 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
* 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
* 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
* 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
* 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
* 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
* 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y
 
=== 2018-02-15 ===
* 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for [[phab:T187435|T187435]]
* 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
* 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
* 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
* 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
* 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
* 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
 
=== 2018-02-14 ===
* 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment ([[phab:T187315|T187315]])
* 13:04 arturo: reboot tools-paws-master-01 for [[phab:T187315|T187315]]
 
=== 2018-02-11 ===
* 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
* 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775
 
=== 2018-02-09 ===
* 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ [[phab:T179343|T179343]] [[phab:T182562|T182562]] [[phab:T186846|T186846]]
* 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
* 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
* 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
* 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
* 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
* 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 ([[phab:T186830|T186830]])
* 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there
 
=== 2018-02-08 ===
* 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
* 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
* 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
* 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
* 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
* 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
* 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
* 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
* 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
* 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
* 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
* 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
* 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
* 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.
 
=== 2018-02-06 ===
* 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
* 13:05 arturo: unpublish/publish trusty-tools repo
* 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for [[phab:T186539|T186539]] after adding it to trusty-tools repo (self contained)
 
=== 2018-02-05 ===
* 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address [[phab:T186539|T186539]]
* 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
* 13:06 arturo: deploying fix for [[phab:T186230|T186230]] using clush
 
=== 2018-02-03 ===
* 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools  python3 ./broken_ref_anchors.py"
 
=== 2018-01-31 ===
* 22:54 chasemp: add bstorm to sudoers as root
 
=== 2018-01-29 ===
* 20:02 chasemp: add zhuyifei1999_ tools root for  [[phab:T185577|T185577]]
* 20:01 chasemp: blast a puppet run to see if any errors are persistent
 
=== 2018-01-28 ===
* 22:49 chicocvenancio: killed compromised session generating miner processes
* 22:48 chicocvenancio: killed miner processes in tools-bastion-03
 
=== 2018-01-27 ===
* 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
* 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive
 
=== 2018-01-25 ===
* 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
* 23:20 arturo: [[phab:T179386|T179386]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 05:25 arturo: deploying misctools and jobutils 1.29 for [[phab:T179386|T179386]]
 
=== 2018-01-23 ===
* 19:41 madhuvishy: Add bstorm to project admins
* 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
* 14:17 chasemp: add me, arturo, chico to sudoers and removed marc
 
=== 2018-01-22 ===
* 18:32 arturo: [[phab:T181948|T181948]] [[phab:T185314|T185314]] deploying jobutils and misctools v1.28 in the cluster
* 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
* 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
* 10:18 arturo: [[phab:T181948|T181948]] deploy misctools 1.27 in the cluster
 
=== 2018-01-19 ===
* 17:32 arturo: [[phab:T185314|T185314]] deploying new version of jobutils 1.27
* 12:56 arturo: the puppet status across the fleet seems good, only minor things like [[phab:T185314|T185314]] , [[phab:T179388|T179388]] and [[phab:T179386|T179386]]
* 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
 
=== 2018-01-18 ===
* 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to [[phab:T182781|T182781]])
* 15:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
* 13:52 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter {{!}} grep lsbdistcodename {{!}} grep trusty && sudo apt-upgrade trusty-wikimedia -v'
* 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
* 12:24 arturo: [[phab:T178717|T178717]] aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
* 12:11 arturo: [[phab:T178717|T178717]] aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
* 11:42 arturo: [[phab:T178717|T178717]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-17 ===
* 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions {{!}} grep upgradeable {{!}} grep trusty-wikimedia' {{!}} tee pending-upgrades-report-trusty-wikimedia.txt
* 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' {{!}} tee pending-upgrades-report.txt
* 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
* 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
* 15:04 andrewbogott: depooling exec-manage tools-exec-1430.  Experimenting with purge-old-kernels
* 14:09 arturo: [[phab:T181647|T181647]] aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'
 
=== 2018-01-16 ===
* 22:01 chasemp: qstat -explain E -xml {{!}} grep 'name' {{!}} sed 's/<name>//' {{!}} sed 's/<\/name>//'  {{!}} xargs qmod -cq
* 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
* 21:24 andrewbogott: repooled tools-exec-1420  and tools-webgrid-lighttpd-1417
* 21:14 andrewbogott: depooling tools-exec-1420  and tools-webgrid-lighttpd-1417
* 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
* 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
* 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412  and tools-exec-1423 for host reboot
* 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413  tools-exec-1442 for host reboot
* 18:50 andrewbogott: switched active proxy back to tools-proxy-02
* 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
* 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
* 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
* 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
* 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
* 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
* 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
* 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
* 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
* 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
* 13:35 chasemp: tools-mail  almouked@ltnet.net 719 pending messages cleared
 
=== 2018-01-11 ===
* 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
* 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
* 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
* 19:00 chasemp: reboot tools-worker-1015
* 15:08 chasemp: reboot tools-exec-1405
* 15:06 chasemp: reboot tools-exec-1404
* 15:06 chasemp: reboot tools-exec-1403
* 15:02 chasemp: reboot tools-exec-1402
* 14:57 chasemp: reboot tools-exec-1401 again...
* 14:53 chasemp: reboot tools-exec-1401
* 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot
 
=== 2018-01-10 ===
* 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
* 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
* 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
* 13:57 arturo: [[phab:T184604|T184604]] cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
* 13:46 arturo: [[phab:T184604|T184604]] aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
* 13:45 arturo: [[phab:T184604|T184604]] aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
* 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
* 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
* 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
* 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for [[phab:T184604|T184604]]
* 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened [[phab:T184604|T184604]]
 
=== 2018-01-09 ===
* 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
* 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
* 23:01 yuvipanda: kill paws master and reboot it
* 22:54 yuvipanda: kill all kube-system pods in paws cluster
* 22:54 yuvipanda: kill all PAWS pods
* 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
* 22:49 yuvipanda: run  clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
* 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash'' to setup kubeadm on all paws worker nodes
* 22:46 yuvipanda: reboot all paws-worker nodes
* 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
* 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
* 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
* 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
* 20:55 chasemp: for n in `kubectl get nodes {{!}} awk '{print $1}' {{!}} grep -v -e tools-worker-1001  -e tools-worker-1016`; do kubectl cordon $n; done
* 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
* 20:15 chasemp: disable puppet on proxies and k8s workers
* 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
* 19:42 chasemp: reboot tools-worker-1010
 
=== 2018-01-08 ===
* 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
* 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02
 
=== 2018-01-06 ===
* 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
* 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)
 
=== 2018-01-05 ===
* 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
* 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
* 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
* 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
* 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
* 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)
 
=== 2018-01-04 ===
* 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of [[phab:T184018|T184018]]
 
=== 2018-01-03 ===
* 15:38 bd808: Forced Puppet run on tools-services-01
* 11:29 arturo: deploy https://gerrit.wikimedia.org/r/#/c/401716/ and https://gerrit.wikimedia.org/r/394101 using clush
 
=== 2017-12-31 ===
* 02:00 bd808: Killed some pwb.py and qacct processes running on tools-bastion-03
 
=== 2017-12-21 ===
* 17:57 bd808: PAWS: deleted hub-deployment pod stuck in crashloopbackoff
* 17:30 bd808: PAWS: deleting hub-deployment pod. Lots of "Connection pool is full" warnings in pod logs
 
=== 2017-12-19 ===
* 21:27 chasemp: reboot tools-paws-master-01
* 18:38 andrewbogott: rebooting tools-paws-master-01
* 05:07 andrewbogott: "service gridengine-master restart" on tools-grid-master
 
=== 2017-12-18 ===
* 12:04 arturo: it seems jupyterhub tries to use a database which doesn't exists: [E 2017-12-18 11:59:49.896 JupyterHub app:904] Failed to connect to db: sqlite:///jupyterhub.sqlite
* 11:58 arturo: The restart didn't work. I could see a lot of log lines in the hub-deployment pod with something like: 2017-12-17 04:08:17,574 WARNING Connection pool is full, discarding connection: 10.96.0.1
* 11:51 arturo: the restart was with: kubectl get pod -o yaml hub-deployment-1381799904-b5g5j -n prod {{!}} kubectl replace --force -f -
* 11:50 arturo: restart pod hub-deployment in paws to try to fix the 502
 
=== 2017-12-15 ===
* 13:55 arturo: same in tools-checker-02.tools.eqiad.wmflabs
* 13:54 arturo: same in tools-exec-1415.tools.eqiad.wmflabs
* 13:52 arturo: running 'sudo puppet agent -t -v' in tools-webgrid-lighttpd-1416.tools.eqiad.wmflabs since didn't update in the last run with clush
 
=== 2017-12-14 ===
* 16:58 arturo: running clush -w @all 'sudo puppet agent --test' from tools-clushmaster-01.eqiad.wmflabs due to https://gerrit.wikimedia.org/r/#/c/394572/ being merged
 
=== 2017-12-13 ===
* 17:37 andrewbogott: upgrading puppet packages on all VMs
* 00:59 madhuvishy: Cordon and Drain tools-worker-1016
* 00:47 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1018-1023, 1025-1027
* 00:34 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1011, 1013-1015, 1017
* 00:28 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1006-1010
* 00:11 madhuvishy: Drain + Cordon, Reboot, Uncordon tools-workers-1002-1005
 
=== 2017-12-12 ===
* 23:29 madhuvishy: rebooting tools-worker-1012
* 18:50 andrewbogott: rebooting tools-worker-1001
 
=== 2017-12-11 ===
* 19:32 bd808: git gc on tools-static-11; --aggressive was killed by system ([[phab:T182604|T182604]])
* 18:07 andrewbogott: upgrading tools puppetmaster to v4
* 17:07 bd808: git gc --aggressive on tools-static-11 ([[phab:T182604|T182604]])
 
=== 2017-12-01 ===
* 15:33 chasemp: put the weird mess of untracked files on tools puppetmaster into stash to see what breaks as they should not be there?
* 15:30 chasemp: prometheus nfs collector on tools-bastion-03
 
=== 2017-11-30 ===
* 23:23 bd808: Hard reboot of tools-bastion-03 via Horizon
* 23:06 chasemp: rebooting login.tools.wmflabs.org due to overload
 
=== 2017-11-20 ===
* 20:34 chasemp: backup crons tools-cron-01:/var/spool/cron# cp -Rp crontabs/ /root/20112017/
* 00:52 andrewbogott: cherry-picking https://gerrit.wikimedia.org/r/#/c/392172/ onto the tools puppetmaster
 
=== 2017-11-17 ===
* 21:33 valhallasw`cloud: also g-w'ed those files, and sent emails to all the affected users
* 21:17 valhallasw`cloud: chmod o-w'ed a bunch of files reported by Dispenser; writing emails to the owners about this
 
=== 2017-11-16 ===
* 17:40 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --enable && sudo puppet agent --test && sudo unattended-upgrades -d'
* 16:50 bd808: Force upgraded nginx on tools-elastic-*
* 16:37 chasemp: reboot tools-checker-01
* 15:17 chasemp: disable puppet
 
=== 2017-11-15 ===
* 22:48 madhuvishy: Rebooted tools-paws-worker-1017
* 15:53 chasemp: reboot bastion-03
* 15:48 chasemp: kill tools.powow on bastion-03 for hammering IO and making bastion unusable
 
=== 2017-11-07 ===
* 01:21 bd808: Removed all non-directory files from /home (via labstore1004 direct access)
 
=== 2017-11-06 ===
* 18:30 bd808: Load on tools-bastion-03 down to 0.72 from 17.47 after killing a bunch of local processes that should have been running on the job grid instead
 
=== 2017-11-05 ===
* 23:48 bd808: Cleaned up 2 huge /tmp files left by tools.croptool (~6.5G)
* 23:44 bd808: Cleaned up 109 files owned by tools.rezabot on tools-webgrid-lighttpd-1428 with `sudo find /tmp -user tools.rezabot -exec rm {} \+`
* 23:37 bd808: Cleaned up 955 files owned by tools.wsexport on tools-webgrid-lighttpd-1428 with `sudo find /tmp -user tools.wsexport -exec rm {} \+`
 
=== 2017-11-03 ===
* 21:19 bd808: Deployed misctools 1.26 ([[phab:T156174|T156174]])
 
=== 2017-11-02 ===
* 16:15 bd808: Restarted nslcd on tools-bastion-03
 
=== 2017-11-01 ===
* 07:11 madhuvishy: Clear nscd cache across all projects post labsdb dns switchover [[phab:T179464|T179464]]
* 07:11 madhuvishy: Clear nscd cache across all projects post labsdb dns switchover
 
=== 2017-10-31 ===
* 16:50 bd808: tools-bastion-03 (tools-login, login.tools) is overloaded
 
=== 2017-10-30 ===
* 17:35 madhuvishy: Clear dns caches across tools hosts `sudo nscd -i hosts`
* 16:08 arturo: repool tools-exec-1401.tools.eqiad.wmflabs
* 15:57 arturo: depool again tools-exec-1401.tools.eqiad.wmflabs for more tests related to [[phab:T179024|T179024]]
* 12:47 arturo: repool tools-exec-1401
* 11:58 arturo: depool tools-exec-1401 to test patch in [[phab:T179024|T179024]] --> aborrero@tools-bastion-03:~$ sudo exec-manage depool tools-exec-1401.tools.eqiad.wmflabs
 
=== 2017-10-24 ===
* 18:09 madhuvishy: Disable puppet on tools-package-builder-01 temporarily ([[phab:T178920|T178920]])
* 13:22 chasemp: start admin webservice
* 13:22 chasemp: stop admin webservice
 
=== 2017-10-23 ===
* 14:49 chasemp: wall message and scheduled reboot in 5m for bastion-03
 
=== 2017-10-18 ===
* 21:36 chasemp: stop basebot  -- it is going crazy and spamming email w/ failing to log to error.log.  Need to figure out how to notify but it's clearly in a failure loop.
* 14:04 chasemp: add strephit creds to elasticsearch per [[phab:T178310|T178310]]
 
=== 2017-10-12 ===
* 16:57 bd808: Rebuilding all Kubernetes Docker images to include toollabs-webservice 0.38
* 16:53 bd808: Upgraded toollabs-webservice to 0.38
 
=== 2017-10-06 ===
* 15:33 bd808: Upgrade jobutils to 1.25 ([[phab:T177614|T177614]])
* 00:27 bd808: Updated misctools to 1.24
 
=== 2017-10-05 ===
* 22:47 bd808: Updated misctools to 1.23
* 22:42 bd808: Updated jobutils to 1.23
* 15:46 chasemp: tools-bastion-03 has tons of local tools running long lived NFS intensive processes.  I'm rebooting rather than playing whackamole.
 
=== 2017-10-03 ===
* 19:30 bd808: `kubectl --namespace=prod delete pod --all` on tools-paws-master-01
 
=== 2017-10-01 ===
* 21:46 madhuvishy: Cold migrating tools-clushmaster-01 from labvirt1015 to labvirt1017
 
=== 2017-09-29 ===
* 19:49 andrewbogott: migration tools-clushmaster-01 to labvirt1015
 
=== 2017-09-25 ===
* 15:14 andrewbogott: rebooting tools-paws-worker-1006 since I can't access it
* 14:57 chasemp: OS_TENANT_NAME=tools openstack server reboot 2c0cf363-c7c3-42ad-94bd-{{Gerrit|e586f2492321}} (unresponsive)
 
=== 2017-09-20 ===
* 16:52 madhuvishy: apt-get install --only-upgrade apache2; service apache2 restart on tools-puppetmaster-01
 
=== 2017-09-19 ===
* 15:22 chasemp: tools-clushmaster-01:~$ clush -f 5 -g all 'sudo puppet agent --test'
* 13:39 chasemp: bastion-03 someone dropped 8.6G in /tmp which is /not/ seemingly on a temp file system
* 13:25 chasemp: wall Bastion disk is full and needs attention and reboot in 60
 
=== 2017-09-18 ===
* 18:02 bd808: Updated PHP5.6 images for Kubernetes ([[phab:T172358|T172358]])
 
=== 2017-09-13 ===
* 15:34 bd808: Running inbound message purge via clush to @tools-exec
* 15:15 bd808: Running outbound message purge via clush to @tools-exec
* 13:57 bd808: apt-get install nginx-common on tools-static-1[01]
* 13:31 bd808: static down due to apparent nginx package upgrade/config change
* 02:10 bd808: Really disabled puppet on tools-mail
* 01:51 bd808: Nuked all messages in the exim spool on tools-mail
* 01:09 bd808: Removed user WiktCAPT from project
* 00:55 bd808: Archived and then purged /var/spool/exim4/input on tools-mail
* 00:47 bd808: Archived and then purged /var/spool/exim4/msglog on tools-mail
* 00:43 bd808: Stopped exim on tools-mail
* 00:43 bd808: Disabled puppet on tools-mail
* 00:15 chasemp: forced to clean out exim queue as the filesystem used up all inodes
 
=== 2017-08-31 ===
* 20:33 madhuvishy: Updated certs and ran puppet, restarted nginx on tools-proxy-* and tools-static-* ([[phab:T174611|T174611]])
* 20:25 madhuvishy: Merging new cert https://gerrit.wikimedia.org/r/#/c/374873/ ([[phab:T174611|T174611]])
* 20:24 madhuvishy: Disabling puppet on tools-proxy-* and tools-static-* for star.wmflabs.org SSL cert update ([[phab:T174611|T174611]])
* 20:23 madhuvishy: Disabling puppet on tools-proxy-* and tools-static-* for star.wmflabs.org SSL cert update
 
=== 2017-08-24 ===
* 19:59 bd808: restarted nslcd and nscd on tools-bastion-03
* 19:59 bd808: restarted nslcd and nscd on tools-bastion-02
 
=== 2017-08-22 ===
* 19:20 andrewbogott: deleted tools-puppetmaster-02, it was replaced a month ago by -01
 
=== 2017-08-12 ===
* 18:38 chasemp: retart admin webservice
 
=== 2017-08-11 ===
* 16:09 chasemp: qdel -f -j {{Gerrit|7441503}}
 
=== 2017-08-10 ===
* 14:59 chasemp: 'become stimmberechtigung && restart' && 'become intersect-contribs && restart'
 
=== 2017-08-09 ===
* 17:28 chasemp: webservices restart tools.orphantalk
 
=== 2017-08-03 ===
* 00:47 bd808: tools-bastion-03 not usably responsive to interactive commands; will reboot
* 00:00 bd808: Restarted kube-proxy service on bastion-03
 
=== 2017-08-02 ===
* 16:59 bd808: Force deleted 6 jobs suck in 'dr' state
 
=== 2017-07-31 ===
* 15:28 chasemp: remove python-keystoneclient from bastion-03
 
=== 2017-07-27 ===
* 23:27 bd808: Killed python procs owned by sdesabbata on tools-login that were stealing all cpu/io
* 21:16 bd808: Disabled puppet on tools-proxy-01 to test nginx proxy config changes
* 16:27 bd808: Enabled puppet on tools-static-11
* 16:10 bd808: Disabled puppet on tools-static-11 to test https://gerrit.wikimedia.org/r/#/c/357878
 
=== 2017-07-26 ===
* 22:33 chasemp: hotpatching an hiera value on tools master to see effects
 
=== 2017-07-20 ===
* 19:48 bd808: Clearing all Eqw state jobs in all queues with: qstat -u '*' {{!}} grep Eqw {{!}} awk '{print $1;}' {{!}} xargs -L1 qmod -cj
* 13:54 andrewbogott: upgrading apache2 on tools-puppetmaster-01
* 04:00 chasemp: tools-webgrid-lighttpd-1402:~# service nslcd restart && service nscd restart
* 03:57 chasemp: tools-exec-1428:~# service nslcd restart && service nscd restart
* 03:57 bd808: Redtarted cron, nscd, nslcd on tools-cron-01
* 03:45 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
* 03:44 chasemp: tools-puppetmaster-01:~# service nslcd restart && service nscd restart
* 03:37 bd808: Restarted apache on tools-puppetmaster-01
 
=== 2017-07-19 ===
* 23:52 bd808: Restarted cron on tools-cron-01; toolschecker job showing user not found errors
* 21:19 valhallasw`cloud: Restarted nslcd on tools-bastion-03 (=tools-login); logins seem functional again.
* 21:18 bd808: Forced puppet run and restarted nscd, nslcd on tools-bastion-02
 
=== 2017-07-18 ===
* 19:51 andrewbogott: enabling puppet on tools-proxy-02.  I don't know why it was disabled.
 
=== 2017-07-17 ===
* 01:43 bd808: Uncordoned tools-worker-1020 after it deleted pods with local storage that were filling the entire disk
* 01:36 bd808: Depooling tools-worker-1020
 
=== 2017-07-13 ===
* 21:59 bd808: Elasticsearch cluster upgraded to 5.3.2
* 21:25 bd808: Upgrading ElasticSearch cluster for [[phab:T164842|T164842]]. There will be service interruptions
* 17:59 bd808: Puppet is disabled on tools-proxy-02 with no reason specified.
* 17:09 bd808: Upgraded nginx-common on tools-proxy-02
* 17:05 bd808: Upgraded nginx-common on tools-proxy-01
 
=== 2017-07-12 ===
* 15:46 chasemp: push out puppet run across tools
* 12:15 andrewbogott: restarting 'admin' webservice
 
=== 2017-07-07 ===
* 18:26 bd808: Forced puppet runs on tools-redis-* for security fix
 
=== 2017-07-03 ===
* 04:26 bd808: cdnjs on tools-static-10 is up to date
* 03:38 bd808: cdnjs on tools-static-11 is up to date
* 02:19 bd808: Cleaning up stuck merges for cdnjs clones on tools-static-10 and tools-static-11
 
=== 2017-07-01 ===
* 19:40 bd808: Disabled puppet on tools-k8s-master-01 to try and fix maintain-kubeusers
* 19:32 bd808: Restarted maintain-kubeusers on tools-k8s-master-01
 
=== 2017-06-30 ===
* 01:33 chasemp: time for i in `cat tools-hosts`; do ssh -i ~/.ssh/labs_root_id_rsa root@$i.eqiad.wmflabs 'hostname -f; uptime; tc-setup'; done
* 01:29 andrewbogott: rebooting tools-cron-01
 
=== 2017-06-29 ===
* 23:01 madhuvishy: Uncordoned all k8s-workers
* 20:50 madhuvishy: deppoling, rebooting and repooling all grid exec nodes
* 20:36 andrewbogott: depooling, rebooting, and repooling every lighttpd node three at a time
* 19:55 madhuvishy: Killed liangent-php jobs and usrd-tools jobs
* 18:00 madhuvishy: drain cordon reboot uncordon tools-worker-1015
* 17:37 madhuvishy: drain cordon reboot uncordon tools-worker-1005 tools-worker-1007 tools-worker-1008
* 17:22 bd808: rebooting tools-static-11
* 17:20 andrewbogott: rebooting tools-static-10
* 17:20 madhuvishy: drain cordon reboot uncordon tools-worker-1012 tools-worker-1003
* 17:13 madhuvishy: drain cordon reboot uncordon tools-worker-1022, tools-worker-1009, tools-worker-1002
* 16:27 chasemp: restart k8s components on master (madhu)
* 16:10 chasemp: tools-flannel-etcd-01:~$ sudo service etcd restart
* 16:04 madhuvishy: reboot tools-worker-1022 tools-worker-1009
* 15:57 chasemp: reboot tools-docker-registery-01 for nfs
 
=== 2017-06-27 ===
* 21:32 andrewbogott: moving all tools nodes to new puppetmaster, tools-puppetmaster-01.tools.eqiad.wmflabs
 
=== 2017-06-25 ===
* 15:13 madhuvishy: Restarted webservice on tools.fatameh
 
=== 2017-06-24 ===
* 16:01 bd808: Created and provisioned elasticsearch password for tools.wmde-uca-test ([[phab:T167971|T167971]])
 
=== 2017-06-23 ===
* 20:20 bd808: Reindexing various elasticsearch indexes created before we upgraded to v2.x
* 20:19 bd808: Dropped garbage indexes in elasticsearch cluster
 
=== 2017-06-22 ===
* 17:03 bd808: Rolled back attempt at Elasticsearch upgrade. Indices need to be rebuilt with 2.x before 5.x can be installed. [[phab:T164842|T164842]]
* 16:19 bd808: Backed up elasticsearch indexes to personal laptop using elasticdump incase [[phab:T164842|T164842]] goes horribly wrong
* 00:12 bd808: Set ownership and permissions on $HOME/.kube for all tools ([[phab:T165875|T165875]])
 
=== 2017-06-21 ===
* 17:43 andrewbogott: repooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 17:42 madhuvishy: Restarted webservice for openstack-browser
* 17:36 andrewbogott: depooling tools-exec-1412, 1415, 1417, 1420, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
* 17:35 andrewbogott: repooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
* 17:24 andrewbogott: depooling tools-exec-1411, 1416, 1418, 1424, tools-webgrid-lighttpd-1404, 1410
* 17:23 andrewbogott: repooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
* 17:11 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, 1409, 1411, 1418, 1420, 1425
* 17:10 andrewbogott: repooling tools-webgrid-lighttpd-1412, tools-exec-1423
* 16:57 andrewbogott: depooling tools-webgrid-lighttpd-1412, tools-exec-1423
* 16:53 andrewbogott: repooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
* 16:52 andrewbogott: repooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 16:35 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428, tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
* 16:29 andrewbogott: depooling tools-exec-1413, 1442, tools-webgrid-lighttpd-1417, 1419, 1421, 1427, 1428
* 16:05 godog: delete pods for lolrrit-wm to force restart
* 15:45 andrewbogott: repooling tools-exec-1422, tools-webgrid-lighttpd-1413
* 15:41 andrewbogott: switching the proxy ip back to tools-proxy-02
* 15:31 andrewbogott: temporarily pointing the tools-proxy IP to tools-proxy-01
* 15:26 andrewbogott: depooling tools-exec-1422, tools-webgrid-lighttpd-1413
* 15:12 andrewbogott: depooling tools-exec-1404, tools-exec-1434, tools-worker-1026
* 15:10 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 14:53 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
* 14:52 andrewbogott: repooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
* 14:37 andrewbogott: depooling tools-exec-1403, tools-exec-gift-trusty-01, tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403
* 14:32 andrewbogott: repooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
* 14:20 andrewbogott: depooling tools-exec-1405, tools-exec-1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, tools-webgrid-lighttpd-1405
* 14:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407
* 13:56 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1440, 1441, tools-webgrid-lighttpd-1402, tools-webgrid-lighttpd-1407
 
=== 2017-06-14 ===
* 22:09 bd808: Restarted apache2 proc on tools-puppetmaster-02
 
=== 2017-06-08 ===
* 18:14 madhuvishy: Also delete from /tmp on tools-webgrid-lighttpd-1411 xvfb-run.*, calibre_* and ws-*.epub
* 18:10 madhuvishy: Delete ws-*.epub from /tmp on tools-webgrid-lighttpd-1426
* 18:07 madhuvishy: Clean up space on /tmp on tools-webgrid-lighttpd-1426 by deleting temp files xvfb-run.* and calibre_1.25.0_tmp_* created by the wsexport tool
 
=== 2017-06-07 ===
* 19:05 madhuvishy: Killed scp job run by user torin8 on tools-bastion-02
 
=== 2017-06-06 ===
* 20:30 chasemp: rebooting tools-bastion-02 as unresponsive (up 76 days and lots of seemingly left behind things running)
 
=== 2017-06-05 ===
* 23:44 bd808: Deleted tools.iabot crontab that somehow got locally installed on tools-exec-1412 on 2017-05-24T20:55Z
* 22:15 bd808: Deleted tools.aibot crontab that somehow got locally installed on tools-exec-1436 on 2017-05-24T20:55Z
* 19:55 andrewbogott: disabling puppet on tools-proxy-01 and -02 for a staged rollout of https://gerrit.wikimedia.org/r/#/c/350494/16
 
=== 2017-06-01 ===
* 15:15 andrewbogott: depooling/rebooting/repooling tools-exec-1403 as part of old kernel-purge testing
 
=== 2017-05-31 ===
* 19:29 bd808: Rebuiding all Docker images to pick up toollabs-webservice v0.37 ([[phab:T163355|T163355]])
* 19:24 bd808: Updating toolabs-webservice package via clush ([[phab:T163355|T163355]])
* 19:16 bd808: Installed toollabs-webservice_0.37_all.deb from local file on tools-bastion-02 ([[phab:T163355|T163355]])
* 16:34 andrewbogott: running 'apt-get -yq autoremove' env='{DEBIAN_FRONTEND: "noninteractive"}' on all instances with salt
* 16:25 andrewbogott: rebooting tools-exec-1404 as part of a disk-space-saving test
* 14:07 andrewbogott: migrating tools-exec-1409 to labvirt1009 to reduce CPU load on labvirt1006 ([[phab:T165753|T165753]])
 
=== 2017-05-30 ===
* 22:32 andrewbogott: migrating tools-webgrid-lighttpd-1406, tools-exec-1410  from labvirt1006 to labvirt1009 to balance cpu usage
* 18:15 andrewbogott: restarted robokobot virgule to free up leaked files
* 17:36 andrewbogott: restarting excel2wiki to clean up file leaks
* 17:36 andrewbogott: restarting idwiki-welcome in kenrick95bot to free up leaked files
* 17:31 andrewbogott: restarting onetools to clean up file leaks
* 17:29 andrewbogott: restarting ytcleaner webservice to clean up leaked files
* 17:22 andrewbogott: restarting vltools to clean up leaked files
* 17:20 madhuvishy: Uncordoned tools-worker-1006
* 17:16 madhuvishy: Killed tool videoconvert on tools-exec-1440 in debugging labstore disk space issues
* 17:15 madhuvishy: Drained and rebooted tools-worker-1006
* 17:15 andrewbogott: restarted croptool to clean up stray files
* 17:15 madhuvishy: depooled, rebooted, and repooled tools-exec-1412
* 17:15 andrewbogott: restarted catmon tool to clean up stray files
 
=== 2017-05-26 ===
* 20:32 bd808: Added tools-webgrid-lighttpd-14{19,2[0-8]} as submit hosts
* 20:31 bd808: Added tools-webgrid-lighttpd-1412 and tools-webgrid-lighttpd-1413 as submit hosts
* 20:28 bd808: sudo qconf -as tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs
 
=== 2017-05-22 ===
* 07:49 chasemp: move ooooold shared resources into archive for later cleanup
 
=== 2017-05-20 ===
* 09:27 madhuvishy: Truncating jerr.log for tool videoconvert since it's 967GB
 
=== 2017-05-10 ===
* 19:11 bd808: Edited striker db record for user Stepan Grigoryev to detach SUL and Phab accounts. [[phab:T164849|T164849]]
* 17:47 bd808: Signed and revoked puppet certs generated when our DNS flipped out and gave hosts non-FQDN hostnames
* 17:29 bd808: Fixed broken puppet cert on tools-package-builder-01
 
=== 2017-05-04 ===
* 19:23 madhuvishy: Rebooting tools-grid-shadow
* 16:21 madhuvishy: Start instance tools-grid-master.tools from horizon
* 16:20 madhuvishy: Shut off tools-grid-master.tools instance from horizon
* 16:16 madhuvishy: Stopped gridengine-shadow on tools-grid-shadow.tools (service gridengine-shadow stop and kill -9 individual shadowd processes)
 
=== 2017-04-24 ===
* 15:33 bd808: Removed Gergő Tisza as a projectadmin for [[phab:T163611|T163611]]; event done
 
=== 2017-04-21 ===
* 22:30 bd808: Added Gergő Tisza as a projectadmin for [[phab:T163611|T163611]]
* 13:43 chasemp: [[phab:T161898|T161898]] clush -g all 'sudo puppet agent --disable "rollout nfs-mount-manager"'
 
=== 2017-04-20 ===
* 17:15 bd808: Deleted shutdown VM tools-docker-builder-04; tools-docker-builder-05 is the new hotness
* 17:11 bd808: kill -INT 19897 on tools-proxy-02 to stop a hung nginx child process left from the last graceful restart of nginx
 
=== 2017-04-19 ===
* 15:10 bd808: apt-get install psmisc on tools-proxy-0[12]
* 13:23 chasemp: stop docker on tools-proxy-01
* 13:20 chasemp: clean up disk space on tools-proxy-01
 
=== 2017-04-18 ===
* 20:37 bd808: Restarted bigbrother on tools-services-02
* 04:23 bd808: Shutdown tools-docker-builder-04; will wait a bit before deleting
* 04:04 bd808: Built and pushed new Docker images based on {{Gerrit|82a46b4}} (Refactor apt-get actions in Dockerfiles)
* 03:42 bd808: Made tools-docker-builder-05.tools.eqiad.wmflabs the active docker build host
* 01:01 bd808: Built instance tools-package-builder-01
 
=== 2017-04-17 ===
* 20:41 bd808: Building tools-docker-builder-05
* 19:35 chasemp: add reedy to sudo all perms so he can admin things
* 17:21 andrewbogott: adding 8 more exec nodes:  tools-exec-1435 through 1442
 
=== 2017-04-11 ===
* 16:46 andrewbogott: added exec nodes tools-exec-1430, 31, 32, 33, 34.
* 14:15 andrewbogott: emptied /srv/pbuilder to make space on tools-docker-04
* 02:35 bd808: Restarted maintain-kubeusers on tools-k8s-master-01
 
=== 2017-04-03 ===
* 13:48 chasemp: enable puppet on gridmaster
 
=== 2017-04-01 ===
* 15:28 andrewbogott: added five new exec nodes, tools-exec-1425 through 1429
* 14:26 chasemp: up nfs thresholds https://gerrit.wikimedia.org/r/#/c/345975/
* 14:00 chasemp: disable puppet on tools-grid-msater
* 13:52 chasemp: tools-grid-master tc-setup clean
* 13:40 chasemp: restart nscd and nscld on tools-grid-master
* 13:31 chasemp: reboot tools-exec-1420
 
=== 2017-03-31 ===
* 22:25 yuvipanda: apt-get update && apt-get install kubernetes-node on tools-proxy-01 to upgrade kube-proxy systemd service unit
 
=== 2017-03-30 ===
* 20:29 chasemp: stop grid-master temporarily & umount -fl project nfs  & remount & start grid-master
* 17:38 chasemp: reboot tools-exec-1401
* 17:30 madhuvishy: Updating tools project hiera config to add role::labs::nfsclient::lookupcache: all via Horizon ([[phab:T136712|T136712]])
* 17:29 madhuvishy: Disabled puppet across tools in prep for [[phab:T136712|T136712]]
 
=== 2017-03-27 ===
* 04:06 andrewbogott: erasing random log files on tools-proxy-01 to avoid filling the disk
 
=== 2017-03-23 ===
* 20:38 andrewbogott: migrating tools-exec-1401 to labvirt1001
* 19:56 andrewbogott: migrating tools-exec-1408 to labvirt1001
* 19:02 andrewbogott: migrating tools-exec-1407 to labvirt1001
* 16:37 andrewbogott: migrating tools-webgrid-lighttpd-1402 and 1407 to labvirt1001 (testing labvirt1001 and easing CPU load on labvirt1010)
 
=== 2017-03-22 ===
* 13:48 andrewbogott: migrating tools-bastion-02 in 15 minutes
 
=== 2017-03-21 ===
* 17:06 andrewbogott: moving tools-webgrid-lighttpd-1404 to labvirt1012 to ease pressure on labvirt1004
* 16:19 andrewbogott: moving tools-exec-1406 to labvirt1011 to ease CPU usage on labvirt1004
 
=== 2017-03-20 ===
* 22:47 yuvipanda: disable puppet on all k8s workers to test https://gerrit.wikimedia.org/r/#/c/343708/
* 18:36 bd808: Applied openstack::clientlib on tools-checker-02 and forced puppet run
* 18:03 bd808: Applied openstack::clientlib on tools-checker-01 and forced puppet run
* 17:31 andrewbogott: migrating tools-exec-1417 to labvirt1013
* 17:05 andrewbogott: migrating tools-webgrid-lighttpd-1410 to labvirt1012 to reduce load on labvirt1001
* 16:42 andrewbogott: migrating tools-webgrid-generic-1404 to labvirt1011 to reduce load on labvirt1001
* 16:13 andrewbogott: migrating tools-exec-1408 to labvirt1010 to reduce load on labvirt1001
 
=== 2017-03-17 ===
* 17:24 andrewbogott: moving tools-webgrid-lighttpd-1416 to labvirt1013 to reduce load on labvirt1004
* 17:15 andrewbogott: moving tools-exec-1424 to labvirt1012 to ease load on labvirt1004
 
=== 2017-03-15 ===
* 19:21 andrewbogott: added new exec nodes:  tools-exec-1421 and tools-exec-1422
* 17:42 madhuvishy: Restarted stashbot
* 17:29 chasemp: docker stop && rm -fR /var/lib/docker/* on worker-1001
* 17:20 chasemp: test of logging
* 16:11 chasemp: k8s master 'for h in `kubectl get nodes {{!}} grep worker {{!}} grep -v NotReady {{!}} grep -v Disabled {{!}} awk '{print $1}'`; do echo $h && kubectl drain --delete-local-data --force $h && sleep 10 ; done'
* 16:08 chasemp: stop puppet on k8s master and drain nodes
* 15:50 chasemp: (late) kill what appears to be an android emulator? unsure but it's eating all IO
 
=== 2017-03-14 ===
* 21:24 bd808: Deleted tools-precise-dev ([[phab:T160466|T160466]])
* 21:13 bd808: Removed non-existent tools-submit.eqiad.wmflabs from submit hosts list
* 21:02 bd808: Deleted tools-exec-gift ([[phab:T160461|T160461]])
* 20:45 bd808: Deleted tools-webgrid-lighttpd-12* nodes ([[phab:T160442|T160442]])
* 20:29 bd808: Deleted tools-exec-12* nodes ([[phab:T160457|T160457]])
* 20:27 bd808: Disassociated floating IPs from tools-exec-12* nodes ([[phab:T160457|T160457]])
* 17:41 madhuvishy: Hand fix tools-puppetmaster by removing the old mariadb submodule directory
* 17:23 madhuvishy: Remove role::toollabs::precise_reminder from tools-bastion-03
* 15:40 bd808: Installing toollabs-webservice 0.36 across cluster using clush
* 15:36 bd808: Upgraded toollabs-webservice to 0.36 on tools-bastion-02.tools
* 15:25 bd808: Installing jobutils 1.21 across cluster using clush
* 15:23 bd808: Installed jobutils 1.21 on tools-bastion-02
* 15:03 bd808: Shutting down webservices running on Precise job grid nodes
 
=== 2017-03-13 ===
* 21:12 valhallasw`cloud: tools-bastion-03: killed heavy unzip operation from staeiou, and heavy (inadvertent large file opening?) vim operation from steenth, as the entire server was blocked due to high i/o
 
=== 2017-03-07 ===
* 17:59 andrewbogott: depooling, migrating tools-exec-1416 as part of ongoing labvirt1001 issues
* 17:21 madhuvishy: tools-webgrid-lighttpd-1409 migrated to labvirt1011 and repooled
* 16:31 madhuvishy: Depooled tools-webgrid-lighttpd-1409 for cold migrating to different labvirt
 
=== 2017-03-06 ===
* 22:52 andrewbogott: migrating tools-webgrid-lighttpd-1411 to labvirt1011 to give labvirt1001 a break
* 19:03 madhuvishy: Stopping webservice running on tool tree-of-life on author request
* 18:25 yuvipanda: set complex_values        slots=300,release=trusty  for tools-exec-gift-trusty-01.tools.eqiad.wmflabs
 
=== 2017-03-04 ===
* 23:47 madhuvishy: Added new k8s workers 1028, 1029
 
=== 2017-02-28 ===
* 03:52 scfc_de: Deployed jobtools and misctools 1.20/1.20~precise+1 ([[phab:T158722|T158722]]).
 
=== 2017-02-27 ===
* 02:42 scfc_de: Purged misctools from instances where not puppetized.
* 02:42 scfc_de: Deployed jobtools and misctools 1.19/1.19~precise+1 ([[phab:T155787|T155787]], [[phab:T156886|T156886]]).
 
=== 2017-02-17 ===
* 12:51 chasemp: create tools-exec-gift-trusty-01
* 12:40 chasemp: create tools-exec-gift-trusty
* 12:24 chasemp: mass apt-get clean and removal of some old .gz log files due to 30+ low space warnings
 
=== 2017-02-15 ===
* 18:45 yuvipanda: clush a restart of nscd across all of tools
* 00:01 bd808: Rebuilt python and python2 Docker images ([[phab:T157744|T157744]])
 
=== 2017-02-08 ===
* 06:22 yuvipanda: drain tools-worker-1026 for docker upgrade
* 05:28 yuvipanda: drain pods from tools-worker-1027.tools.eqiad.wmflabs for docker upgrade
* 05:28 yuvipanda: disable puppet on all k8s nodes in preparation for docker upgrade
 
=== 2017-02-07 ===
* 13:49 scfc_de: Deployed toollabs-webservice_0.33_all.deb ([[phab:T156605|T156605]], [[phab:T156626|T156626]]).
* 13:49 scfc_de: Deployed tools-manifest_0.11_all.deb.
 
=== 2017-02-04 ===
* 02:13 yuvipanda: launch tools-worker-1027 to see if puppet works fine on first run!
* 02:13 yuvipanda: reboot tools-worker-1026 to see if it comes up fine
* 01:46 yuvipanda: launch tools-worker-1026
 
=== 2017-02-03 ===
* 21:34 madhuvishy: Migrated over precise tools to trusty for user multichill (catbot, family, locator, multichill, nlwikibots, railways,  wlmtrafo, wikidata-janitor)
* 21:13 chasemp: reboot tools-bastion-03 as unresponsive
 
=== 2017-02-02 ===
* 20:39 yuvipanda: import docker-engine 1.11.2 (currently running version) and 1.12.6 (latest version) into aptly
* 00:06 madhuvishy: Remove user maximilianklein from tools.cite-o-meter (on request)
 
=== 2017-01-30 ===
* 20:25 yuvipanda: sudo ln -s /usr/bin/kubectl /usr/local/bin/kubectl to temporarily fix webservice shell not working
 
=== 2017-01-27 ===
* 19:22 chasemp: reboot tools-bastion-02 as it is having issues
* 02:01 madhuvishy: Reenabled puppet on tools-checker-01
* 00:29 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/
 
=== 2017-01-26 ===
* 23:37 madhuvishy: reenabled puppet on tools-checker
* 23:02 madhuvishy: Disabling puppet on tools-checker instances to test https://gerrit.wikimedia.org/r/#/c/334433/
* 16:08 chasemp: major cleanup for stale var items on tools-exec-1221
 
=== 2017-01-24 ===
* 18:14 andrewbogott: one last reboot of tools-mail
* 18:00 andrewbogott: apt-get autoremove on tools-mail
* 17:51 andrewbogott: rebooting tools-mail post upgrade
* 17:19 andrewbogott: restarting tools-mail, beginning do-release-upgrade -d -q
* 17:17 andrewbogott: backing up tools-mail to ~root/8c499e6e-1b79-4bb1-8f7f-72fee1f74ea5-backup on labvirt1009
* 17:15 andrewbogott: stopping tools-mail, backing up, upgrading from precise to trusty
* 15:49 yuvipanda: clush -g all 'sudo rm /usr/local/bin/kube*' to get rid of old kube related binaries
* 14:42 yuvipanda: re-enable puppet on tools-proxy-01, test success on proxy-02
* 14:37 yuvipanda: disable puppet on tools-proxy-01 (active proxy) to check deploying debianized kube-proxy on proxy-02
* 13:52 yuvipanda: upgrading k8s on worker nodes to use debs + new k8s version
* 13:52 yuvipanda: finished upgrading k8s + using debs
* 12:49 yuvipanda: purge ancient kubectl, kube-apiserver, kube-controller-manager, kube-scheduler packages from tools-k8s-master-01, these were my old terrible packages
 
=== 2017-01-23 ===
* 19:36 andrewbogott: temporarily shutting down tools-webgrid-lighttpd-1201
* 19:35 yuvipanda: depool tools-webgrid-lighttpd-1201 for snapshotting tests
* 17:13 chasemp: reboot tools-exec-1411 as having serious transient issues
 
=== 2017-01-20 ===
* 15:58 yuvipanda: enabling puppet across all hosts
* 15:36 yuvipanda: disable puppet everywhere to cherrypick patch moving base to a profile
* 00:50 bd808: sudo qdel -f {{Gerrit|1199218}} to force delete a stuck toolschecker job
 
=== 2017-01-17 ===
* 18:47 madhuvishy: Reenabled puppet across tools
* 18:26 madhuvishy: Disabling puppet across tools to test https://gerrit.wikimedia.org/r/#/c/329707/
 
=== 2017-01-11 ===
* 22:09 chasemp: add Reedy to admin in tool labs (approved by bryon and chase for access to investigate specific tool abuse behavior)
 
=== 2017-01-10 ===
* 19:05 madhuvishy: Killed 3 jobs from tools.arnaub that were causing high load on tools-exec-1411
 
=== 2017-01-06 ===
* 19:02 bd808: Terminated deprecated instances tools-exec-121[2-6] ([[phab:T154539|T154539]])
 
=== 2017-01-04 ===
* 02:43 madhuvishy: Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. [[phab:T152369|T152369]]
 
=== 2017-01-03 ===
* 23:56 bd808: Removed tools-exec-12[12-16] from gridengine ([[phab:T154539|T154539]])
* 23:27 bd808: drained tools-exec-1216 ([[phab:T154539|T154539]])
* 23:26 bd808: drained tools-exec-1215 ([[phab:T154539|T154539]])
* 23:25 bd808: drained tools-exec-1214 ([[phab:T154539|T154539]])
* 23:25 bd808: drained tools-exec-1213 ([[phab:T154539|T154539]])
* 23:24 bd808: drained tools-exec-1212 ([[phab:T154539|T154539]])
* 23:11 madhuvishy: Disabled puppet on tools-checker-01 ([[phab:T152369|T152369]])
* 21:43 madhuvishy: Adding iptables rule to drop incoming connections from toolschecker on labservices1001
* 20:51 madhuvishy: Adding iptables rule to block outgoing connections to labservices1001 on tools-checker-01
* 20:43 madhuvishy: Silenced tools checker on icinga to test labservices1001 failure causing toolschecker to flake out  [[phab:T152369|T152369]]
 
=== 2016-12-25 ===
* 00:28 yuvipanda: comment out cron running 'clean' script of avicbot every minute without -once
* 00:28 yuvipanda: force delete all jobs of avicbot
* 00:25 yuvipanda: delete all jobs of avicbot. This is 419 jobs
* 00:20 yuvipanda: kill clean.sh process of avicbot
 
=== 2016-12-19 ===
* 20:07 valhallasw`cloud: killed gps_exif_bot2.py (tools.gpsexif), was using 50MB/s io, lagging all of tools-bastion-03
* 13:06 yuvipanda: run  /usr/local/bin/deploy-master http://tools-docker-builder-03.tools.eqiad.wmflabs v1.3.3wmf1 on tools-k8s-master-01
* 12:53 yuvipanda: cleaned out pbuilder from tools-docker-builder-01 to clean up
 
=== 2016-12-17 ===
* 04:49 yuvipanda: turned on lookupcache again for bastions
 
=== 2016-12-15 ===
* 18:52 yuvipanda: reboot tools-exec-1204
* 18:49 yuvipanda: reboot tools-webgrid-lighttpd-12[01-05]
* 18:45 yuvipanda: reboot tools-exec-gift
* 18:41 yuvipanda: reboot tools-exec-1217 to 1221
* 18:30 yuvipanda: rebooted tools-exec-1212 to 1216
* 14:55 yuvipanda: reboot tools-services-01
 
=== 2016-12-14 ===
* 18:43 mutante: tools-bastion-03 - ran 'locale-gen ko_KR.EUC-KR' for [[phab:T130532|T130532]]
 
=== 2016-12-13 ===
* 20:54 chasemp: reboot bastion-03 as unresponsive
 
=== 2016-12-09 ===
* 19:32 godog: upgrade / restart prometheus-node-exporter
* 08:37 YuviPanda: run delete-dbusers and force replica.my.cnf creation for all tools that did not have it
 
=== 2016-12-08 ===
* 18:48 YuviPanda: restarted toolschecker on tools-checker-01
 
=== 2016-12-07 ===
* 09:45 YuviPanda: restart redis on tools-proxy-02
* 09:32 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/324210 and https://gerrit.wikimedia.org/r/324211
* 09:29 YuviPanda: clush -g k8s-worker -g k8s-master -g webproxy -b 'sudo puppet agent --disable "Deploying k8s change with alex"'
 
=== 2016-12-06 ===
* 00:36 bd808: Updated toollabs-webservice to 0.31 on rest of cluster ([[phab:T147350|T147350]])
 
=== 2016-12-05 ===
* 23:19 bd808: Updated toollabs-webservice to 0.31 on tools-bastion-02 ([[phab:T147350|T147350]])
* 22:55 bd808: Updated jobutils to 1.17 on tools-mail ([[phab:T147350|T147350]])
* 22:53 bd808: Updated jobutils to 1.17 on tools-precise-dev ([[phab:T147350|T147350]])
* 22:53 bd808: Updated jobutils to 1.17 on tools-cron-01 ([[phab:T147350|T147350]])
* 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-03 ([[phab:T147350|T147350]])
* 22:52 bd808: Updated jobutils to 1.17 on tools-bastion-02 ([[phab:T147350|T147350]])
* 16:53 bd808: Terminated deprecated instances: "tools-exec-1201", "tools-exec-1202", "tools-exec-1203", "tools-exec-1205", "tools-exec-1206", "tools-exec-1207", "tools-exec-1208", "tools-exec-1209", "tools-exec-1210", "tools-exec-1211" ([[phab:T151980|T151980]])
* 16:50 bd808: Released floating IPs from decommissioned tools-exec-12[01-11] instances
 
=== 2016-11-30 ===
* 23:06 bd808: Removed tools-exec-12[00-11] from gridengine ([[phab:T151980|T151980]])
* 22:54 bd808: Removed tools-exec-12[00-11] from @general hostgroup
* 15:17 chasemp: restart coibot 'coibot.sh -o syslog.output -e syslog.errors -r yes'
* 05:20 bd808: rescheduled continuous jobs on tools-exec-1210; 2 task queue jobs remain ([[phab:T151980|T151980]])
* 05:18 bd808: drained tools-exec-1211 ([[phab:T151980|T151980]])
* 05:14 bd808: drained tools-exec-1209 ([[phab:T151980|T151980]])
* 05:13 bd808: drained tools-exec-1208 ([[phab:T151980|T151980]])
* 05:12 bd808: drained tools-exec-1207 ([[phab:T151980|T151980]])
* 05:10 bd808: drained tools-exec-1206 ([[phab:T151980|T151980]])
* 05:07 bd808: drained tools-exec-1205 ([[phab:T151980|T151980]])
* 05:04 bd808: drained tools-exec-1204 ([[phab:T151980|T151980]])
* 05:00 bd808: drained tools-exec-1203 ([[phab:T151980|T151980]])
* 05:00 bd808: drained tools-exec-1202 ([[phab:T151980|T151980]])
* 04:58 bd808: disabled queues on tools-exec-1211 ([[phab:T151980|T151980]])
* 04:58 bd808: disabled queues on tools-exec-1210 ([[phab:T151980|T151980]])
* 04:58 bd808: disabled queues on tools-exec-1209 ([[phab:T151980|T151980]])
* 04:57 bd808: disabled queues on tools-exec-1208 ([[phab:T151980|T151980]])
* 04:57 bd808: disabled queues on tools-exec-1207 ([[phab:T151980|T151980]])
* 04:57 bd808: disabled queues on tools-exec-1206 ([[phab:T151980|T151980]])
* 04:56 bd808: disabled queues on tools-exec-1205 ([[phab:T151980|T151980]])
* 04:56 bd808: disabled queues on tools-exec-1204 ([[phab:T151980|T151980]])
* 04:56 bd808: disabled queues on tools-exec-1203 ([[phab:T151980|T151980]])
* 04:55 bd808: disabled queues on tools-exec-1202 ([[phab:T151980|T151980]])
* 04:52 bd808: drained tools-exec-1201 ([[phab:T151980|T151980]])
* 04:48 bd808: draining tools-exec-1201
 
=== 2016-11-29 ===
* 13:43 hashar: updating jouncebot so it properly reclaim its nick ( [[phab:T150916|T150916]] https://gerrit.wikimedia.org/r/#/c/324025/ )
 
=== 2016-11-22 ===
* {{SAL entry|1=15:13 chasemp: readd attr +i to replica.my.cnf that seems to have gotten lost in rsync migration}}
 
=== 2016-11-21 ===
* {{SAL entry|1=21:15 YuviPanda: disable puppet everywhere}}
* {{SAL entry|1=19:49 YuviPanda: restart all webservice jobs on gridengine to pick up logging again}}
 
=== 2016-11-20 ===
* {{SAL entry|1=06:51 Krenair: ran `qmod -rj lighttpd-admin` as tools.admin to try to get the main page back up, it worked briefly but then broke again}}
 
=== 2016-11-16 ===
* {{SAL entry|1=20:14 yuvipanda: upgrade toollabs-webservice to 0.30 on all webgrid nodes}}
* {{SAL entry|1=18:31 chasemp: reboot tools-exec-1404 (already depooled)}}
* {{SAL entry|1=18:19 chasemp: reboot tools-exec-1403}}
* {{SAL entry|1=17:23 chasemp: reboot tools-exec-1212 (converted via 321786 testing for recovery on boot)}}
* {{SAL entry|1=16:55 chasemp: clush -g all "puppet agent --disable 'trail run for changeset 321786 handling /var/lib/gridengine'"}}
* {{SAL entry|1=02:05 yuvipanda: rebooting tools-docker-registry-01, can't ssh in}}
* {{SAL entry|1=01:43 yuvipanda: cleanup old images on tools-docker-builder-03}}
 
=== 2016-11-15 ===
* {{SAL entry|1=19:52 chasemp: reboot tools-precise-dev}}
* {{SAL entry|1=05:20 yuvipanda: restart all k8s webservices too}}
* {{SAL entry|1=05:05 yuvipanda: restarting all webservices on gridengine}}
* {{SAL entry|1=03:21 chasemp: reboot tools-checker-01}}
* {{SAL entry|1=02:56 chasemp: reboot tools-exec-1405 to ensure noauto works (because atboot=>false is lies)}}
* {{SAL entry|1=02:31 chasemp: reboot tools-exec-1406}}
 
=== 2016-11-14 ===
* {{SAL entry|1=22:51 chasemp: shut down bastion 02 and 05 and make 03 root only}}
* {{SAL entry|1=19:35 madhuvishy: Stopped cron on tools-cron-01 (T146154)}}
* {{SAL entry|1=18:24 madhuvishy: Tools NFS is read-only. /data/project and /home across tools are ro T146154}}
* {{SAL entry|1=16:57 yuvipanda: stopped gridengine master}}
* {{SAL entry|1=16:47 yuvipanda: start restarting kubernetes webservice pods}}
* {{SAL entry|1=16:30 madhuvishy: Unmounted all nfs shares from tools-k8s-master-01 (sudo /usr/local/sbin/nfs-mount-manager clean) T146154}}
* {{SAL entry|1=16:22 yuvipanda: kill maintain-kubeusers on tools-k8s-master-01, sole process touching NFS}}
* {{SAL entry|1=16:22 chasemp: enable puppet and run on tools-services-01}}
* {{SAL entry|1=16:21 yuvipanda: restarting all webservice jobs, watching webservicewatcher logs on tools-services-02}}
* {{SAL entry|1=16:14 madhuvishy: Disabling puppet across tools T146154}}
 
=== 2016-11-11 ===
* 20:49 madhuvishy: Dual mount of tools share complete. Puppet reenabled across tools hosts. T146154
* 20:18 madhuvishy: Rolling out dual mount of tools share across all hosts T146154
* 19:29 madhuvishy: Disabling puppet across tools to dual mount tools share from labstore-secondary T146154
 
=== 2016-11-02 ===
* 18:23 yuvipanda: manually stop tools-grid-master for reboot
* 17:42 yuvipanda: drain nodes from labvirt1012 and 13
* 13:42 chasemp: depool tools-exec-1404 for maint
 
=== 2016-11-01 ===
* 21:54 yuvipanda: stop gridengine-master on tools-grid-master in preparation for reboot
* 21:34 yuvipanda: depool tools nodes on labvirt1012
* 21:16 yuvipanda: depool things in labvirt1011
* 20:58 yuvipanda: depool tools nodes on labvirt1010
* 20:32 yuvipanda: depool tools things on labvirt1005 and 1009
* 20:08 yuvipanda: depooled things on labvirt1006 and 1008
* 19:51 yuvipanda: move tools-elastic-03 to labvirt1010, -02 already in 09
* 19:34 yuvipanda: migrate tools-elastic-03 to labvirt1009
* 19:10 yuvipanda: depooled tools nodes from labvirt1004 and 1007
* 17:57 yuvipanda: depool exec nodes on labvirt1002
* 13:27 chasemp: reboot tools-exec-1404 post depool for test
 
=== 2016-10-31 ===
* 21:50 yuvipanda: deleted cyberbot queue with qconf -dq cyberbot
* 21:44 yuvipanda: restarted cron on tools-cron-01
 
=== 2016-10-30 ===
* 02:25 yuvipanda: restarted maintain-kubeusers
 
=== 2016-10-29 ===
* 17:21 yuvipanda: depool tools-worker-1005
 
=== 2016-10-28 ===
* 20:15 chasemp: restart prometheus service on tools-prometheus-01 to see if that wakes it up
* 20:06 yuvipanda: restart kube-apiserver again, ran into too many open file handles
* 15:58 Yuvi[m]: restart k8s master, seems to have run out of fds
* 15:43 chasemp: restart toolschecker service on 01 and 02
 
=== 2016-10-27 ===
* 21:09 godog: upgrade prometheus on tools-prometheus0[12]
* 18:49 andrewbogott: rebooting  tools-webgrid-lighttpd-1401
* 13:51 chasemp: reboot tools-webgrid-generic-1403
* 13:50 chasemp: reboot dockerbuilder-01
 
=== 2016-10-26 ===
* 23:20 madhuvishy: Disabling puppet on tools proxy hosts for applying proxy health check endpoint T143638
* 23:17 godog: upgrade prometheus on tools-prometheus-02
* 16:52 bd808: Deployed jobutils_1.16_all.deb on tools-mail (default jsub target to trusty)
* 16:50 bd808: Deployed jobutils_1.16_all.deb on tools-precise-dev (default jsub target to trusty)
* 16:48 bd808: Deployed jobutils_1.16_all.deb on tools-bastion-02, tools-bastion-03, tools-cron-01 (default jsub target to trusty)
 
=== 2016-10-25 ===
* 18:48 yuvipanda: repool all depooled instances
* 04:19 yuvipanda: reboot tools-flannel-etcd-01 for https://phabricator.wikimedia.org/T149072#2741012
 
=== 2016-10-24 ===
* 03:45 Krenair: reset host keys for tools-puppetmaster-02 on -01, looks like it was recreated 5-6 days ago
 
=== 2016-10-20 ===
* 16:55 yuvipanda: killed bzip2 taking 100% CPU on tools-bastion-03
 
=== 2016-10-18 ===
* 22:56 Guest20046: flip tools-k8s-master-01 to tools-puppetmaster-02
* 07:43 yuvipanda: move all tools webgrid nodes to tools-puppetmaster-02 too
* 07:40 yuvipanda: complete moving all general tools exec nodes to tools-puppetmaster-02
* 07:33 yuvipanda: restarted puppetmaster on tools-puppetmaster-01
 
=== 2016-10-17 ===
* 14:37 chasemp: remove bdsync-deb and bdsync-deb-2 errornously created in Tools and now defunct anyway
* 14:05 chasemp: restart puppetmaster on tools-puppetmaster-01 (instances sticking on puppet runs for a long time)
* 14:01 chasemp: reboot tools-exec-1215 and tools-exec-1410 as unresponsive
 
=== 2016-10-14 ===
* 16:20 yuvipanda: repoooled tools-worker-1012, seems to have recovered?!
* 15:57 yuvipanda: drain tools-worker-1012, seems stuck
 
=== 2016-10-10 ===
* 18:04 valhallasw`vecto: sudo service bigbrother restart @ tools-services-02
 
=== 2016-10-09 ===
* 18:33 valhallasw`cloud: removed empty local crontabs for {yuvipanda, yuvipanda, tools.toolschecker} on {tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1204, tools-checker-01}. No other local crontabs remaining.
 
=== 2016-10-05 ===
* 12:15 chasemp: reboot tools-webgrid-generic-1404 as locked up
 
=== 2016-10-01 ===
* 10:03 yuvipanda: re-enable puppet on tools-checker-02
 
=== 2016-09-29 ===
* 18:15 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs via wikitech; couldn't ssh in
* 18:10 bd808: Investigating elasticsearch cluster issues effecting stashbot
 
=== 2016-09-27 ===
* 08:07 chasemp: tools-bastion-03:~# chmod 640 /var/log/syslog
 
=== 2016-09-25 ===
* 15:27 Krenair: restarted labs-logbot under tools.morebots
 
=== 2016-09-21 ===
* 18:56 madhuvishy: Repooled tools-webgrid-lighttpd-1418 (T146212) after dns records cleanup
* 18:42 madhuvishy: Repooled tools-webgrid-lighttpd-1416 (T146212) after dns records cleanup
* 16:57 chasemp: reboot tools-webgrid-lighttpd-1407, tools-webgrid-lighttpd-1210, tools-webgrid-lighttpd-1414, and then tools-webgrid-lighttpd-1405 as the first 3 return
 
=== 2016-09-20 ===
* 23:24 yuvipanda: depool tools-webgrid-lighttpd-1416 and 1418, they aren't in actual working order
* 21:23 madhuvishy|food: Pooled new sge exec node  tools-webgrid-lighttpd-1416 (T146212)
* 21:17 madhuvishy|food: Pooled new sge exec node  tools-webgrid-lighttpd-1415 (T146212)
* 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1418 (T146212)
* 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1416 (T146212)
* 20:34 madhuvishy: Created new instance tools-webgrid-lighttpd-1415 (T146212)
* 17:58 andrewbogott: reboot tools-exec-1410
* 17:54 yuvipanda: repool tools-webgrid-lighttpd-1412
* 17:49 yuvipanda: webgrid-lighttpd-1412 hung on io (no change in nova diagnostics), rebooting
* 17:33 yuvipanda: reboot tools-puppetmaster-01
* 17:20 yuvipanda: reboot tools-checker-02
* 15:42 chasemp: move floating ip from tools-checker-02 (failed) to tools-checker-01
 
=== 2016-09-13 ===
* 21:09 madhuvishy: Bumped proxy nginx worker_connections limit T143637
* 21:08 madhuvishy: Reenabled puppet across proxy hosts
* 20:44 madhuvishy: Disabling puppet across proxy hosts
 
=== 2016-09-12 ===
* 18:33 bd808: Forcing puppet run on tools-cron-01
* 18:31 bd808: Forcing puppet run on tools-bastion-03
* 18:28 bd808: Forcing puppet run on tools-bastion-02
* 18:26 bd808: Forcing puppet run on tools-precise-dev
* 18:26 bd808: Built toollabs-webservice v0.27 package and added to aptly
 
=== 2016-09-10 ===
* 01:06 yuvipanda: migrate tools-k8s-etcd-01 to labvirt1012, is in state doing no io
 
=== 2016-09-09 ===
* 19:27 yuvipanda: reboot tools-exec-1218 and 1219
* 18:10 yuvipanda: killed massive grep running as root
 
=== 2016-09-08 ===
* 21:49 bd808: forcing puppet runs to install toollabs-webservice_0.26_all.deb
* 20:51 bd808: forcing puppet runs to install jobutils_1.15_all.deb
 
=== 2016-09-07 ===
* 21:11 Krenair: brought labs/private.git up to date on tools-puppetmaster-01
* 02:32 Krenair: ran `SULWatcher/restart_SULWatcher.sh` as `tools.stewardbots` on bastion-03 to fix T144887
 
=== 2016-09-06 ===
* 22:14 yuvipanda: got pbuilder off tools-services-01, was taking up too much space.
* 22:10 madhuvishy: Deleted instance tools-web-static-01 and tools-web-static-02 (T143637)
* 21:45 yuvipanda: reboot tools-prometheus-02. nova diagnostics shows no vda activity.
* 20:43 chasemp: drain and reboot tools-exec-1410 for testing
* 07:32 yuvipanda: depooled tools-exec-1219 and 1218, seem to be unresponsive, causing jobs that appear to run but aren't really
 
=== 2016-09-05 ===
* 16:27 andrewbogott: rebooting tools-cron-01 because it is hanging all over the place
 
=== 2016-09-01 ===
* 05:19 yuvipanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck
 
=== 2016-08-31 ===
* 20:48 madhuvishy: Reenabled puppet across tools hosts
* 20:45 madhuvishy: Scratch migration complete on all grid exec nodes (T134896)
* 19:36 madhuvishy: Scratch migration on all non exec/worker nodes complete (T134896)
* 18:18 madhuvishy: Scratch migration complete for all k8s workers (T134896)
* 17:50 madhuvishy: Reenabling puppet across tools hosts.
* 16:55 madhuvishy: Rsync-ed over latest backup of /srv/scratch from labstore1001 to labstore1003
* 16:50 madhuvishy: Puppet disabling complete (T134896)
 
=== 2016-08-30 ===
* 18:54 valhallasw`cloud: edited /etc/shadow on a range of hosts to fix https://phabricator.wikimedia.org/T143191
* 10:59 godog: bounce stashbot, not seen on irc
 
=== 2016-08-29 ===
* 23:38 Krenair: added myself to the tools.admin service group earlier to try to figure out what was causing the outage, removed again now
* 16:35 yuvipanda: run chmod u+x /data/project/framabot
* 13:40 chasemp: restart jouncebot
 
=== 2016-08-28 ===
* 05:34 bd808: After git gc on web-static-02.tools:/srv/cdnjs: /dev/mapper/vd-cdnjs--disk  61G  54G  3.3G  95% /srv
* 05:25 bd808: sudo git gc --aggressive on tools-web-static-01.tools:/srv/cdnjs
* 04:56 bd808: sudo git gc --aggressive on tools-web-static-02.tools:/srv/cdnjs
 
=== 2016-08-26 ===
* 16:53 yuvipanda: migrate tools-static-02 to labvirt1001
 
=== 2016-08-25 ===
* 18:07 yuvipanda: restart puppetmaster on tools-puppetmaster-01
* 17:41 yuvipanda: depooled tools-webgrid-1413
* 01:16 yuvipanda: restarted puppetmaster on tools-puppetmaster-01
 
=== 2016-08-24 ===
* 23:03 chasemp: reboot tools-exec-1217
* 17:25 yuvipanda: depool tools-exec-1217, it is dead/stuck/hung/io-starved
 
=== 2016-08-23 ===
* 07:08 madhuvishy: Enabled puppet across tools after merging https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896)
* 05:48 yuvipanda: restarted nginx on tools-proxy-01, was out of connection slots
 
=== 2016-08-22 ===
* 22:07 madhuvishy: Disabled puppet across tools hosts in preparation to merge https://gerrit.wikimedia.org/r/#/c/305657/ (see T134896)
* 22:01 madhuvishy: Disabling puppet across tools hosts
 
=== 2016-08-20 ===
* 11:42 valhallasw`cloud: rebooting tools-mail (hanging)
 
=== 2016-08-19 ===
* 14:52 chasemp: reboot 82323ee4-762e-4b1f-87a7-d7aa7afa22f6
 
=== 2016-08-18 ===
* 20:00 yuvipanda: restarted maintain-kubeusers on tools-k8s-master-01
 
=== 2016-08-15 ===
* 22:10 yuvipanda: depool tools-exec-1211 and 1205, seem to be out of action
* 19:12 yuvipanda: kill unused tools-merlbot-proxy
 
=== 2016-08-12 ===
* 20:39 yuvipanda: delete tools-webgrid-lighttpd-1415, enough webservices have moved to k8s from that queue
* 20:37 yuvipanda: delete tools-logs-01, going to recreate with a smaller image
* 20:36 yuvipanda: delete tools-webgrid-generic-1405, enough things have moved to k8s from that queue!
* 20:10 yuvipanda: migration of tools-grid-master to labvirt1013 complete
* 20:01 yuvipanda: migrating tools-grid-master (currently inactive) to labvirt1013 away from crowded 1010
* 12:40 chasemp: tools.templatetransclusioncheck@tools-bastion-03:~$ webservice restart
 
=== 2016-08-11 ===
* 20:13 yuvipanda: tools-grid-master finally stopped
* 20:05 yuvipanda: disabled tools-webgrid-lighttpd-1202, is hung
* 17:23 yuvipanda: instance being rebooted is tools-grid-master
* 17:22 chasemp: reboot via nova master as it is stuck
 
=== 2016-08-05 ===
* 19:29 paladox: adding tom29739 to lolrrit-wm project
 
=== 2016-08-04 ===
* 19:09 yuvipanda: cleaned up nginx log files in tools-docker-registry-01 to fix free space warning
* 00:19 yuvipanda: added Krenair as admin to help with T132225 and other issues.
 
=== 2016-08-03 ===
* 22:48 yuvipanda: deleted tools-worker-1005
* 22:08 yuvipanda: depool & delete tools-worker-1007 and 1008
* 21:34 yuvipanda: rebooting tools-puppetmaster-01 to test a hypothesis
* 21:10 yuvipanda: rebooting tools-puppetmaster-01 for kernel upgrade
* 00:20 madhuvishy: Repooled nodes tools-worker 1012 and 1013 for T141126
 
=== 2016-08-02 ===
* 22:49 yuvipanda: depooled tools-worker-1014 as well for T141126
* 22:44 yuvipanda: depool tools-worker-1015 for T141126
* 22:42 paladox: cherry picking 302617 onto lolrrit-wm
* 22:41 madhuvishy: Depooling tools-worker 1012 and 1013 for T141126
* 22:32 yuvipanda: added paladox to tools
* 09:38 godog: bounce morebots production
* 00:01 yuvipanda: depool tools-worker-1017 for T141126
 
=== 2016-08-01 ===
* 23:48 madhuvishy: Repooled tools-worker-1011 and tools-worker-1018 (Yuvi) for T114126
* 23:41 madhuvishy: Repooled tools-worker-1010 and tools-worker-1019 (Yuvi) for T114126
* 23:21 madhuvishy: Yuvi is depooling tools-worker-1018 for T114126
* 23:19 madhuvishy: Depooling tools-worker 1010 and 1011 for T114126
* 23:17 madhuvishy: Yuvi depooled tools-worker-1019 for T114126
* 23:06 madhuvishy: Added tools-worker-1022 as new k8s worker node
* 23:06 madhuvishy: Repooled tools-worker-1009 (T114126)
* 22:48 madhuvishy: Depooling tools-worker-1009 to prepare for T141126
 
=== 2016-07-29 ===
* 22:04 YuviPanda: repooled tools-worker-1006
* 21:48 YuviPanda: deleted tools-worker-1006 after depooling+draining
* 21:45 YuviPanda: repool new tools-worker-1003 with direct-lvm docker storage backend
* 21:30 YuviPanda: depool tools-worker-1003 to be recreated with new docker config, picking this because it's on a non-ssd host
* 21:17 YuviPanda: depooled tools-worker-1020/21 after fixing them up
* 20:41 YuviPanda: delete tools-worker-1001
* 20:29 YuviPanda: depool tools-worker-1001, going to recreate with to test new puppet deploying-first-run
* 20:26 YuviPanda: built new worker nodes tools-worker-1020 and 21 with direct-lvm storage backend
* 17:48 YuviPanda: disable puppet on all tools k8s worker nodes
 
=== 2016-07-25 ===
* 14:17 chasemp: nova reboot 64f01f90-c805-4a2e-9ed5-f523b909094e (grid master)
 
=== 2016-07-23 ===
* 23:21 YuviPanda: restart maintain-kubeusers on tools-k8s-master-01, was stuck on connecting to seaborgium preventing new tool creation
* 01:56 YuviPanda: deploy kubernetes v1.3.3wmf1
 
=== 2016-07-22 ===
* 17:30 YuviPanda: repool tools-worker-1018
* 14:04 chasemp: reboot tools-worker-1015 as stuck w/ high iowait warning seconds ago.  I cannot ssh in as root.
 
=== 2016-07-21 ===
* 22:42 chasemp: reboot tools-worker-1018 as stuck T141017
 
=== 2016-07-20 ===
* 21:27 andrewbogott: rebooting tools-k8s-etcd-01
* 11:14 Guest9334: rebooted tools-worker-1004
 
=== 2016-07-19 ===
* 01:06 bd808: Upgraded Elasticsearch on tools-elastic-* to 2.3.4
 
=== 2016-07-18 ===
* 21:50 YuviPanda: force downgrade hhvm on tools-webgrid-lighttpd-1408 to fix puppet issues
* 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm on tools-worker-1004
* 21:40 YuviPanda: bind mount and kill files in /var/lib/docker that were monuted over by proper mount on lvm
* 21:37 YuviPanda: killed tools-pastion-01, no longer in use
* 20:59 bd808: Disabled puppet on tools-elastic-0[123]. Elasticsearch needs to be upgraded.
* 15:15 YuviPanda: kill 8807036 for Luke081515
* 12:48 YuviPanda: reboot tools-flannel-etcd-03 for T140256
* 12:41 YuviPanda: reboot tools-k8s-etcd-02 for T140256
 
=== 2016-07-15 ===
* 10:24 yuvipanda: depool tools-exec-1402 for T138447
* 10:24 yuvipanda: reboot tools-exec-1402 for T138447
* 10:16 yuvipanda: depooling tools-webgrid-lighttpd-1402 and -1412 since they seem to be suffering from T138447
* 10:08 yuvipanda: reboot tools-webgrid-lighttpd-1402 and 1412
 
=== 2016-07-14 ===
* 23:12 bd808: Added Madhuvishy to project "roots" sudoer list
* 22:58 bd808: Added Madhuvishy as projectadmin
* 21:25 chasemp: change perms for tools.readmore to correct bot
 
=== 2016-07-13 ===
* 11:40 yuvipanda: cold-migrate tools-worker-1014 off labvirt1010 to see if that improves the ksoftirqd situation
* 11:19 yuvipanda: drained tools-worker-1004 - high ksoftirqd usage even with no load
* 11:13 yuvipanda: depool tools-worker-1014 - unusable, totally in iowait
* 11:13 yuvipanda: reboot tools-worker-1004, was unresponsive
 
=== 2016-07-12 ===
* 18:07 yuvipanda: reboot tools-worker-1012, it seems to have failed LDAP connectivity :|
 
=== 2016-07-08 ===
* 12:38 yuvipanda: starting up tools-web-static-02 again
 
=== 2016-07-07 ===
* 12:45 yuvipanda: start deployment of k8s 1.3.0wmf4 for T139259
 
=== 2016-07-06 ===
* 13:09 yuvipanda: associated a floating IP with tools-k8s-master-01 for T139461
* 11:47 yuvipanda: moved tools-checker-0[12] to use tools-puppetmaster-01 as puppetmaster so they get appropriate CA for use when talking to kubernetes API
 
=== 2016-07-04 ===
* 11:13 yuvipanda: delete tools-prometheus-01 to free up resources on labvirt1010
* 11:11 yuvipanda: actually deleted instance tools-cron-02 to free up resources on labvirt1010 - was large and not currently used, and failover process takes a while anyway, so we can recreate if needed
* 11:11 yuvipanda: stopped instance tools-cron-02 to free up some resources on labvirt1010
 
=== 2016-07-03 ===
* 17:09 yuvipanda: run qstat -u '*' | grep 'dr ' | awk '{ print $1;}' | xargs -L1 qdel -f to clean out jobs stuck in dr state
* 16:59 yuvipanda: migrate tools-web-static-02 to labvirt1011 to provide more breathing room
* 16:56 yuvipanda: delete temp-test-trusty-package to provide more breathing room on labvirt1010
* 13:49 yuvipanda: reboot tools-exec-1219
* 13:37 yuvipanda: migrating tools-exec-1216 to labvirt1011
* 13:07 yuvipanda: delete tools-bastion-01 which was shut down anyway
* 13:04 yuvipanda: attempt to reboot tools-exec-1212
 
=== 2016-06-28 ===
* 15:25 bd808: Signed client cert for tools-worker-1019.tools.eqiad.wmflabs on tools-puppetmaster-01.tools.eqiad.wmflabs
 
=== 2016-06-21 ===
* 16:49 bd808: Updated jobutils to v1.14 for T138178
 
=== 2016-06-17 ===
* 06:17 yuvipanda: forced deletion of 7033590 for dykbot for shubinator
 
=== 2016-06-08 ===
* 20:31 yuvipanda: start tools-bastion-03 was stuck in 'stopped' state
* 20:31 yuvipanda: reboot tools-bastion-03
 
=== 2016-05-31 ===
* 17:35 valhallasw`cloud: re-enabled queues on  tools-exec-1407, tools-exec-1216, tools-exec-1219
* 13:13 chasemp: reboot of tools-exec-1203 see T136495 all jobs seem gone now
 
=== 2016-05-30 ===
* 13:06 valhallasw`cloud: rebooting tools-exec-1221
* 11:53 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652 https://gerrit.wikimedia.org/r/#/c/290479 https://gerrit.wikimedia.org/r/#/c/291710/ on tools-puppetmaster-01
 
=== 2016-05-29 ===
* 18:58 YuviPanda: deleted tools-k8s-bastion-01 for T136496
* 14:29 valhallasw`cloud: chowned /data/project/xtools-mab-dev to root and back to stop rogue process that was writing to the directory. I'm still not sure where that process  was running, but at least this seems to have solved the issue
 
=== 2016-05-28 ===
* 21:52 valhallasw`cloud: rebooted tools-webgrid-lighttpd-1408, tools-pastion-01, tools-exec-1205
* 21:21 valhallasw`cloud: rebooting tools-exec-1204 (T136495)
 
=== 2016-05-27 ===
* 14:45 YuviPanda: start moving tools-bastion-03 to use tools-puppetmaster-01 as puppetmaster
 
=== 2016-05-25 ===
* 20:15 YuviPanda: deleted tools-bastion-mtemp per chasemp
* 19:43 YuviPanda: delete devpi instance, not currently in use
* 19:39 YuviPanda: run  sudo dpkg --configure -a on tools-worker-1007 to get it unstuck
* 19:19 YuviPanda: deleted tools-docker-builder-01 and -02, hosed hosts that are unused
* 17:18 YuviPanda: fixed hhvm upgrade on tools-cron-01
* 07:19 YuviPanda: hard reboot tools-services-01, was completely stuck on /public/dumps
* 06:06 bd808: Restarting all webservice jobs
* 05:33 andrewbogott: rebooting tools-proxy-02
 
=== 2016-05-24 ===
* 01:36 scfc_de: tools-cron-02: Downgraded hhvm (sudo apt-get install hhvm).
* 01:36 scfc_de: tools-bastion-03, tools-checker-01, tools-cron-02, tools-exec-1202, tools-proxy-02, tools-redis-1001: Remounted /public/dumps read-only (while sudo umount /public/dumps; do :; done && sudo puppet agent -t).
 
=== 2016-05-23 ===
* 19:36 YuviPanda: switched tools-checker to tools-checker-03
* 16:33 bd808: Rebooting tools-elastic-02.tools.eqiad.wmflabs
* 13:28 chasemp: 'apt-get install hhvm -y --force-yes' across trusty hosts to handle hhvm downgrade
 
=== 2016-05-20 ===
* 23:39 bd808: Forced puppet run on bastion-02 & bastion-05 to apply fix for T135861
* 19:47 chasemp: tools-exec-1406 having issues rebooting
 
=== 2016-05-19 ===
* 21:07 bd808: deployed jobutils 1.13 on bastions; now with '-l release=...' validation!
* 15:43 YuviPanda: rebooting all tools worker instances
* 13:12 chasemp: reboot tools-exec-1220 stuck in state of unresponsivenss
 
=== 2016-05-13 ===
* 00:40 YuviPanda: cleared all queues that were in error state
 
=== 2016-05-12 ===
* 22:59 YuviPanda: restart tools-worker-1004 to attempt bringing it back up
* 22:59 YuviPanda: deploy k8s 1.2.4wmf1 on all proxy nodes
* 22:58 YuviPanda: deploy k8s on all worker nodes
* 22:46 YuviPanda: deploy k8s master for 1.2.4wmf1
 
=== 2016-05-10 ===
* 04:25 bd808: Added role::package::builder to tools-services-01
 
=== 2016-05-09 ===
* 04:33 YuviPanda: reboot tools-worker-1004, lots of ksoftirqd stuckness despite no actual containers running
 
=== 2016-05-08 ===
* 07:06 YuviPanda: restarted admin tool
 
=== 2016-05-05 ===
* 13:11 godog: cherry-pick https://gerrit.wikimedia.org/r/#/c/280652/ on puppetmaster
 
=== 2016-04-28 ===
* 04:15 YuviPanda: delete half of the trusty webservice jobs
* 04:00 YuviPanda: deleted all precise webservice jobs, waiting for webservicemonitor to bring them back up
 
=== 2016-04-24 ===
* 12:22 YuviPanda: force deleted job 5435259 from pbbot per PeterBowman
 
=== 2016-04-11 ===
* 14:20 andrewbogott: moving tools-bastion-mtemp to labvirt1009
 
=== 2016-04-06 ===
* 15:20 bd808: Removed local hack for T131906 from tools-puppetmaster-01
 
=== 2016-04-05 ===
* 21:24 bd808: Committed local hack on tools-puppetmaster-01 to get elasticsearch working again
* 21:02 bd808: Forcing puppet runs to fix elasticsearch
* 20:39 bd808: Elasticsearch processes down. Looks like a prod puppet change that needs tweaking for tool labs
 
=== 2016-04-04 ===
* 19:43 YuviPanda: new bastion!
* 19:15 chasemp: reboot tools-bastion-05
 
=== 2016-03-30 ===
* 15:50 andrewbogott: rebooting tools-proxy-01 in hopes of clearing some bad caches
 
=== 2016-03-28 ===
* 20:51 yuvipanda: lifted RAM quota from 900Gigs to 1TB?!
* 20:30 chasemp: change perm grant files from create-dbusers for chmod 400 chat chattr +i
 
=== 2016-03-27 ===
* 17:40 scfc_de: tools-webgrid-generic-1405, tools-webgrid-lighttpd-1411, tools-web-static-01, tools-web-static-02: "apt-get install cloud-init" and accepted changes for /etc/cloud/cloud.cfg (users: + default; cloud_config_modules: + ssh-import-id, + puppet, + chef, + salt-minion; system_info/package_mirrors/arches[i386, amd64]/search/primary: + http://%(region)s.clouds.archive.ubuntu.com/ubuntu/).
 
=== 2016-03-18 ===
* 15:47 chasemp: had to kill stalkboten as it was logging constant errors filling logs to the tune of hundreds of gigs
* 15:36 chasemp: cleanup huge log collection for broken bot: /srv/project/tools/project/betacommand-dev/tspywiki/irc/logs# rm -fR SpamBotLog.log\.*
 
=== 2016-03-11 ===
* 20:57 mutante: reverted font changes - puppet runs recovering
* 20:37 mutante: more puppet issues due to font dependencies on trusty, on it
* 19:39 mutante: should a tools-exec server be influenced by font packages on an mw appserver?
* 19:39 mutante: fixed puppet runs on tools-exec (gerrit 276792)
 
=== 2016-03-02 ===
* 14:56 chasemp: qdel 3956069 and 3758653 for abusing auth
 
=== 2016-02-29 ===
* 21:49 scfc_de: tools-exec-1218: rm -f /usr/local/lib/nagios/plugins/check_eth to work around "Got passed new contents for sum" (https://tickets.puppetlabs.com/browse/PUP-1334).
* 21:20 scfc_de: tools-exec-1209: rm -f /var/lib/puppet/state/agent_catalog_run.lock (no Puppet process running, probably from the reboots).
* 20:58 scfc_de: Ran "dpkg --configure -a" on all instances.
* 13:50 scfc_de: Deployed jobutils/misctools 1.10.
 
=== 2016-02-28 ===
* 20:08 bd808: Removed unwanted NFS mounts from tools-elastic-01.tools.eqiad.wmflabs
 
=== 2016-02-26 ===
* 19:08 bd808: Upgraded Elasticsearch on tools-elastic-0[123] to 1.7.5
 
=== 2016-02-25 ===
* 21:43 scfc_de: Deployed jobutils/misctools 1.9.
 
=== 2016-02-24 ===
* 19:46 chasemp: runonce deployed for https://gerrit.wikimedia.org/r/#/c/272891/
 
=== 2016-02-22 ===
* 15:55 andrewbogott: redirecting tools-login.wmflabs.org to tools-bastion-05
 
=== 2016-02-19 ===
* 15:58 chasemp: rerollout tools nfs shaping pilot for sanity in anticipation of formalization
* 09:21 _joe_: killed cluebot3 instance on tools-exec-1207, writing 20 M/s to the error log
* 00:50 yuvipanda: failover services to services-02
 
=== 2016-02-18 ===
* 20:37 yuvipanda: failover proxy back to tools-proxy-01
* 19:46 chasemp: repool labvirt1003 and depool labvirt1004
* 18:19 chasemp: draining nodes from labvirt1001
 
=== 2016-02-16 ===
* 21:33 chasemp: reboot of bastion-1002
 
=== 2016-02-12 ===
* 19:56 chasemp: nfs traffic shaping pilot round 2
 
=== 2016-02-05 ===
* 22:01 chasemp: throttle some vm nfs write speeds
* 16:49 scfc_de: find /data/project/wikidata-edits -group ssh-key-ldap-lookup -exec chgrp tools.wikidata-edits \{\} + (probably a remnant of the work on ssh-key-ldap-lookup last summer).
* 16:45 scfc_de: Removed /data/project/test300 (uid/gid 52080; none of them resolves, no databases, just an unmodified pywikipedia clone inside).
 
=== 2016-02-03 ===
* 03:00 YuviPanda: upgraded flannel on all hosts running it
 
=== 2016-01-31 ===
* 20:01 scfc_de: tools-webgrid-generic-1405: Rebooted via wikitech; rebooting via "shutdown -r now" did not seem to work.
* 18:51 bd808: tools-elastic-01.tools.eqiad.wmflabs console shows blocked tasks, possible kernel bug?
* 18:49 bd808: tools-elastic-01.tools.eqiad.wmflabs not responsive to ssh or Elasticsearch requests; rebooting via wikitech interface
* 13:32 hashar: restarted qamorebot
 
=== 2016-01-30 ===
* 06:38 scfc_de: tools-webgrid-generic-1405: Rebooted for load ~ 175 and lots of processes stuck in D.
 
=== 2016-01-29 ===
* 21:25 YuviPanda: restarted image-resize-calc manually, no service.manifest file
 
=== 2016-01-28 ===
* 15:02 scfc_de: tools-cron-01: Rebooted via wikitech as "shutdown -r now" => "@sbin/plymouthd --mode=shutdown" => "/bin/sh -e /proc/self/fd/9" => "/bin/sh /etc/init.d/rc 6" => "/bin/sh /etc/rc6.d/S20sendsigs stop" => "sync" stuck in D.  *argl*
* 14:56 scfc_de: tools-cron-01: Rebooted due to high number of processes stuck in D and load >> 100.
* 14:54 scfc_de: tools-cron-01: HUPped 43 processes wikitrends/refresh.sh, though a lot of all processes seem to be stuck in D, so I'll reboot this instance.
* 14:50 scfc_de: tools-cron-01: HUPped 85 processes /usr/lib/php5/sessionclean.
 
=== 2016-01-27 ===
* 23:07 YuviPanda: removed all members of templatetiger, added self instead, removed active shell sessions
* 20:24 chasemp: master stop, truncate accounting log to accounting.01272016, master start
* 19:34 chasemp: master start grid master
* 19:23 chasemp: stopped master
* 19:11 YuviPanda: depooled tools-webgrid-1405 to prep for restart, lots of stuck processes
* 18:29 valhallasw`cloud: job 2551539 is ifttt, which is also running as 2700629. Killing 2551539 .
* 18:26 valhallasw`cloud: messages repeatedly reports "01/27/2016 18:26:17|worker|tools-grid-master|E|execd@tools-webgrid-generic-1405.tools.eqiad.wmflabs reports running job (2551539.1/master) in queue "webgrid-generic@tools-webgrid-generic-1405.tools.eqiad.wmflabs" that was not supposed to be there - killing". SSH'ing there to investigate
* 18:24 valhallasw`cloud: 'sleep' test job also seems to work without issues
* 18:23 valhallasw`cloud: no errors in log file, qstat works
* 18:23 chasemp: master sge restarted post dump and restart for jobs db
* 18:22 valhallasw`cloud: messages file reports 'Wed Jan 27 18:21:39 UTC 2016 db_load_sge_maint_pre_jobs_dump_01272016'
* 18:20 chasemp: master db_load -f /root/sge_maint_pre_jobs_dump_01272016 sge_job
* 18:19 valhallasw`cloud: dumped jobs database to /root/sge_maint_pre_jobs_dump_01272016, 4.6M
* 18:17 valhallasw`cloud: SGE Configuration successfully saved to /root/sge_maint_01272016 directory.
* 18:14 chasemp: grid master stopped
* 00:56 scfc_de: Deployed admin/www bde15df..12a3586.
 
=== 2016-01-26 ===
* 21:28 YuviPanda:  qstat -u '*' | grep E | awk '{print $1}' | xargs -L1 qmod -cj
* 21:16 chasemp: reboot tools-exec-1217.tools.eqiad.wmflabs
 
=== 2016-01-25 ===
* 20:30 YuviPanda: switched over cron host to tools-cron-01, manually copied all old cron files from tools-submit to tools-cron-01
* 19:06 chasemp: kill python merge/merge-unique.py tools-exec-1213 as it seemed to be overwhelming nfs
* 17:07 scfc_de: Deployed admin/www at bde15df2a379c33edfb8350afd2f0c7186705a93.
 
=== 2016-01-23 ===
* 15:49 scfc_de: Removed remnant send_puppet_failure_emails cron entries except from unreachable hosts sacrificial-kitten, tools-worker-06 and tools-worker-1003.
 
=== 2016-01-21 ===
* 22:24 YuviPanda: deleted tools-redis-01 and -02 (are on 1001 and 1002 now)
* 21:13 YuviPanda: repooled exec nodes on labvirt1010
* 21:08 YuviPanda: gridengine-master started, verified shadow hasn't started
* 21:00 YuviPanda: stop gridengine master
* 20:51 YuviPanda: repooled exec nodes on labvirt1007 was last message
* 20:51 YuviPanda: repooled exec nodes on labvirt1006
* 20:39 YuviPanda: failover tools-static too tools-web-static-01
* 20:38 YuviPanda: failover tools-checker to tools-checker-01
* 20:32 YuviPanda: depooled exec nodes on 1007
* 20:32 YuviPanda: repooled exec nodes on 1006
* 20:14 YuviPanda: depooled all exec nodes in labvirt1006
* 20:11 YuviPanda: repooled exec node son 1005
* 19:53 YuviPanda: depooled exec nodes on labvirt1005
* 19:49 YuviPanda: repooled exec nodes from labvirt1004
* 19:48 YuviPanda: failed over proxy to tools-proxy-01 again
* 19:31 YuviPanda: depooled exec nodes from labvirt1004
* 19:29 YuviPanda: repooled exec nodes from labvirt1003
* 19:13 YuviPanda: depooled instances on labvirt1003
* 19:06 YuviPanda: re-enabled queues on exec nodes that were on labvirt1002
* 19:02 YuviPanda: failed over tools proxy to tools-proxy-02
* 18:46 YuviPanda: drained and disabled queues on all nodes on labvirt1002
* 18:38 YuviPanda: restarted all restartable jobs in instances on labvirt1001 and deleted all non-restartable ghost jobs. these were already dead
 
=== 2016-01-12 ===
* 09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).
 
=== 2016-01-11 ===
* 22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
* 22:12 YuviPanda: restarted gridengine master again
* 22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
* 22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
* 21:57 valhallasw`cloud: reset to 7:30
* 21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
* 21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
* 21:45 YuviPanda: restarted gridengine master
* 21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
* 21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
* 21:41 valhallasw`cloud: currently 353 jobs in qw state
* 21:40 valhallasw`cloud: that's load_adjustment_decay_time
* 21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
* 19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
* 17:51 YuviPanda: kill all queries running on labsdb1003
* 17:20 YuviPanda: stopped webservice for quentinv57-tools
 
=== 2016-01-09 ===
* 21:07 valhallasw`cloud: moved tools-checker/208.80.155.229  back to tools-checker-01
* 21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
* 13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.
 
=== 2016-01-08 ===
* 19:46 chasemp: couldn't get into tools-mail-01 at all and it seemed borked so I rebooted
* 17:23 andrewbogott: killing tools.icelab as per https://wikitech.wikimedia.org/wiki/User_talk:Torin#Running_queries_on_tools-dev_.28tools-bastion-02.29
 
=== 2015-12-30 ===
* 04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
* 03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
* 02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
* 00:40 YuviPanda: restarted master on grid-master
* 00:40 YuviPanda: copied and cleaned out spooldb
* 00:10 YuviPanda: reboot tools-grid-shadow
* 00:08 YuviPanda: attempt to stop shadowd
* 00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
* 00:00 YuviPanda: kill -9'd gridengine master
 
=== 2015-12-29 ===
* 23:31 YuviPanda: rebooting tools-grid-master
* 23:22 YuviPanda: restart gridengine-master on tools-grid-master
* 00:18 YuviPanda: shut down redis on tools-redis-01
 
=== 2015-12-28 ===
* 22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
* 22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
* 21:32 YuviPanda: disable puppet on tools-redis-01 and -02
* 21:27 YuviPanda: created tools-redis-1001
 
=== 2015-12-23 ===
* 21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
* 21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
* 19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
* 18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
* 18:40 valhallasw`cloud: scratch that, first going to eat dinner
* 18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
* 14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
* 10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel
 
=== 2015-12-22 ===
* 18:30 YuviPanda: rescheduling all webservices
* 18:17 YuviPanda: failed over active proxy to proxy-01
* 18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
* 01:42 YuviPanda: rebooting tools-worker-08
 
=== 2015-12-21 ===
* 18:44 YuviPanda: reboot tools-proxy-01
* 18:31 YuviPanda: failover proxy to tools-proxy-02
 
=== 2015-12-20 ===
* 00:00 YuviPanda: tools-worker-08 stuck again :|
 
=== 2015-12-18 ===
* 15:16 andrewbogott: rebooting locked up host tools-exec-1409
 
=== 2015-12-16 ===
* 23:14 andrewbogott: rebooting  tools-exec-1407, unresponsive
* 22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
* 21:28 andrewbogott: deleted tools-docker-registry-01
* 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup
 
=== 2015-12-12 ===
* 10:08 YuviPanda: restarted cron on tools-submit
 
=== 2015-12-10 ===
* 12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.
 
=== 2015-12-07 ===
* 13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
* 10:46 YuviPanda: restarted nscd on tools-proxy-01
 
=== 2015-12-06 ===
* 10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest
 
=== 2015-12-04 ===
* 19:33 Coren: switching master role to tools-grid-master
* 04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
* 04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01
 
=== 2015-12-02 ===
* 18:29 Coren: switching gridmaster activity to tools-grid-shadow
* 05:13 yuvipanda: increased security groups quota to 50 because why not
 
=== 2015-12-01 ===
* 21:07 yuvipanda: added bd808 as admin
* 21:01 andrewbogott: deleted tool/service group tools.test300
 
=== 2015-11-25 ===
* 15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002
 
=== 2015-11-20 ===
* 22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
* 21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
* 21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
* 21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
* 21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
* 20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
* 20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
* 20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
* 20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
* 19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
* 19:25 Coren: -lighttpd-1403 wants a restart.
* 19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
* 18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
* 18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
* 18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services
 
=== 2015-11-17 ===
* 19:39 YuviPanda: created tools-worker-03 to be k8s worker node
* 19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens
 
=== 2015-11-16 ===
* 20:44 PlasmaFury: switch over the proxy to tools-proxy-01
* 17:38 PlasmaFury: deleted tools-webgrid-lighttpd-1412 for https://phabricator.wikimedia.org/T118654
 
=== 2015-11-03 ===
* 03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).
 
=== 2015-11-02 ===
* 22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
* 22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
* 22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
* 19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
* 19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs
 
=== 2015-10-26 ===
* 20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts
 
=== 2015-10-11 ===
* 22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code
 
=== 2015-10-09 ===
* 22:47 yuvipanda: kill NFS on tools-puppetmaster-01 with https://wikitech.wikimedia.org/wiki/Hiera:Tools/host/tools-puppetmaster-01
* 14:37 Coren: Beginning rotation of execution nodes to apply fix for T106170
 
=== 2015-10-06 ===
* 04:35 yuvipanda: created tools-puppetmaster-02 as hot spare
 
=== 2015-10-02 ===
* 17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).
 
=== 2015-10-01 ===
* 23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
* 23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
* 23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
* 22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel
 
=== 2015-09-30 ===
* 07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
* 06:40 yuvipanda: migrated webproxy to tools-proxy-01
 
=== 2015-09-29 ===
* 12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).
 
=== 2015-09-28 ===
* 15:24 Coren: rebooting tools-shadow after mount option changes.
 
=== 2015-09-25 ===
* 16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).
 
=== 2015-09-24 ===
* 14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
* 13:56 scfc_de: tools-master: Restarted grid engine master for T109485.
 
=== 2015-09-23 ===
* 18:22 valhallasw`cloud: here = https://etherpad.wikimedia.org/p/74j8K2zIob
* 18:22 valhallasw`cloud: experimenting with https://github.com/jordansissel/fpm on tools-packages, and manually installing packages for that. Noting them here.
 
=== 2015-09-16 ===
* 17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
* 01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
* 01:17 YuviPanda: attempting to move to kubernetes
 
=== 2015-09-15 ===
* 01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.
 
=== 2015-09-14 ===
* 23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.
 
=== 2015-09-13 ===
* 20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
* 20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).
 
=== 2015-09-11 ===
* 14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).
 
=== 2015-09-08 ===
* 08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.<br>Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
* 08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
* 08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools
 
=== 2015-09-07 ===
* 18:49 valhallasw`cloud: ran sudo mount -o remount /data/project    on tools-static-01, which also solved the issue, so skipping the reboot
* 18:47 valhallasw`cloud: switched static webserver to tools-static-02
* 18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
* 17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master
 
=== 2015-09-03 ===
* 07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
* 07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
* 07:07 valhallasw`cloud: err, is empty.
* 07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!
 
=== 2015-09-02 ===
* 15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
* 13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
* 13:55 valhallasw`cloud: restarted gridengine_exec on  tools-exec-1403
* 13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678  = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
* 13:16 YuviPanda: deleted all jobs of ralgisbot
* 13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
* 12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles
 
=== 2015-09-01 ===
* 21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
* 16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
* 15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
* 06:23 valhallasw`cloud: seems to have worked. SGE :(
* 06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
* 06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
* 06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
* 06:06 valhallasw`cloud: investigating SGE issues reported on irc/email
 
=== 2015-08-31 ===
* 23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
* 21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
* 21:20 valhallasw`cloud: restarted webservicemonitor
* 21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
* 21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
* 21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
* 21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
* 20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
* 20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
* 20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
* 20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
* 19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
* 19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
* 19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
* 07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)
 
=== 2015-08-30 ===
* 13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
* 13:20 valhallasw`cloud: disabling 503 error page
 
=== 2015-08-29 ===
* 04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".
 
=== 2015-08-27 ===
* 15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again
 
=== 2015-08-26 ===
* 01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start.  If it goes berserk, please service bigbrothermonitor stop.
 
=== 2015-08-25 ===
* 20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
* 14:58 YuviPanda: pooled in two new instances for the precise exec pool
* 14:45 YuviPanda: reboot tools-exec-1221
* 14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
* 14:18 YuviPanda: pooled in tools-webgrid-generic-1405
* 10:16 YuviPanda: created tools-webgrid-generic-1405
* 10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
* 09:59 YuviPanda: created tools-exec-1220 and -1221
 
=== 2015-08-24 ===
* 16:37 valhallasw`cloud: more processes were started, so added a talk page message on [[User:Coet]] (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
* 16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
* 16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01
 
=== 2015-08-20 ===
* 18:44 valhallasw`cloud: both are now at 3dbbc87
* 18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
* 18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
* 18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
* 17:06 valhallasw`cloud: wait, what timezone is this?!
 
=== 2015-08-19 ===
* 10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406
 
=== 2015-08-18 ===
* 15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
* 14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
* 13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
* 13:55 valhallasw`cloud: no, wait, that's ''tools-webgrid-lighttpd-1411.eqiad.wmflabs'', not the actual host ''tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs''. We should fix that dns mess as well.
* 13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
* 13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
* 13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using <code>for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done</code>
* 08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
* 08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
* 08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
* 08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
* 08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
* 08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻
* 08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
* 08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
* 08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
* 08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
* 08:00 valhallasw`cloud: running puppet agent -tv again
* 07:55 valhallasw`cloud: argh. Disabling  toollabs::node::web::generic again and enabling  toollabs::node::web::lighttpd
* 07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
* 07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory  --- ran sudo touch /usr/lib/adminbot/README
* 07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
* 07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
* 07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...
 
=== 2015-08-17 ===
* 19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
* 16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
* 15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215  tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
* 14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215  tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
 
=== 2015-08-15 ===
* 05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
* 05:10 andrewbogott: suspending tools-exec-gift, just for a moment...
 
=== 2015-08-14 ===
* 17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215  tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
* 15:20 andrewbogott: Adding back to the grid engine queue:  tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
* 14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
 
=== 2015-08-13 ===
* 18:51 valhallasw`cloud: which was resolved by scfc earlier
* 18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid <br> Run of Puppet configuration client already in progress; skipping  (/var/lib/puppet/state/agent_catalog_run.lock exists))
* 18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
* 16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
* 14:48 andrewbogott: and tools-webgrid-lighttpd-1408
* 14:48 andrewbogott: rescheduling (and in some cases killing) jobs on  tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204  tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405
 
=== 2015-08-12 ===
* 16:05 andrewbogott: depooling  tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204  tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
* 15:20 valhallasw`cloud: re-enabling queues on restarted hosts
* 14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410
 
=== 2015-08-11 ===
* 18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow
 
=== 2015-08-04 ===
* 13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).
 
=== 2015-08-03 ===
* 19:13 andrewbogott: deleted tools-static-01
 
=== 2015-08-01 ===
* 18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
* 16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).
 
=== 2015-07-30 ===
* 15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
* 14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
* 02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
* 02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
* 02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
* 02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
* 01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
 
=== 2015-07-29 ===
* 23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
* 20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
* 19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}
 
=== 2015-07-28 ===
* 17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
* 17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
* 17:16 valhallasw`cloud: disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs"
* 02:07 YuviPanda: removed pacct files from tools-bastion-01
 
=== 2015-07-27 ===
* 21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of [[phab:T107052]]: <pre>accton off</pre>
 
=== 2015-07-19 ===
* 01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).
 
=== 2015-07-11 ===
* 00:01 mutante: fixing puppet runs on tools-webgrid-* via salt
 
=== 2015-07-10 ===
* 23:59 mutante: fixing puppet runs on tools-exec via salt
 
=== 2015-07-10 ===
* 20:09 valhallasw`cloud: it took three of us, but adminbot is updated!
 
=== July 6 ===
* 09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)
 
=== July 2 ===
* 17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
* 16:12 valhallasw`cloud: I mean tools-bastion-01
* 16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/
 
=== June 29 ===
* 17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02
 
=== June 21 ===
* 18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
* 16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
* 16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).
 
=== June 19 ===
* 15:07 YuviPanda: remounting /data/scratch
 
=== June 10 ===
* 11:52 YuviPanda: tools-trusty be gone
 
=== June 8 ===
* 16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access
 
=== June 7 ===
* 17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS
 
=== June 5 ===
* 17:44 YuviPanda: migrate tools-shadow to labvirt1002
 
=== June 2 ===
* 18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
* 16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
* 16:20 Coren: switching back to tools-master
* 16:10 YuviPanda: restart nscd on tools-submit
* 15:54 Coren: Switching names for tools-exec-1401
* 15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
* 14:34 YuviPanda: turned off dnsmasq for toollabs
* 13:54 Coren: adding new-style names for submit hosts
* 13:53 YuviPanda: moved tools-master / shadow to designate
* 13:52 Coren: new-style names for gridengin admin hosts added
* 13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
* 13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
* 13:17 Coren: killing the sge_qmaster to test failover
* 12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd
 
=== May 29 ===
* 13:39 YuviPanda: tools-redis-01 is redis master now
* 13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
* 13:01 YuviPanda: recreating tools-redis-01 and -02
* 12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
* 12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff
 
=== May 28 ===
* 12:22 wm-bot: petrb: inserted some local IP's to hosts file
* 12:15 wm-bot: petrb: shutting nscd off on tools-master
* 12:14 wm-bot: petrb: test
* 11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
* 11:25 petan: rebooted tools-master in order to try fix that network issues
 
=== May 27 ===
* 20:10 LostPanda: disabled puppet on tools-shadow too
* 19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster  haaail someone?
* 19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
* 18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS
 
=== May 23 ===
* 19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).
 
=== May 22 ===
* 20:37 yuvipanda: deleted and depooled tools-exec-07
 
=== May 20 ===
* 20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
* 20:01 yuvipanda: enabling puppet on all hosts
* 20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
* 19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
* 19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
* 19:54 yuvipanda: enabled puppet on tools-precise-dev
* 19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
* 06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state
 
=== May 19 ===
* 21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
* 20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
* 20:12 yuvipanda: force killed croptool webservice
 
=== May 18 ===
* 01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
* 01:32 yuvipanda: killed tools-checker-01 instance, recreating
 
=== May 15 ===
* 12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
* 12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
* 00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis
 
=== May 14 ===
* 21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
* 21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
* 03:29 yuvipanda: drained, depooled and deleted tools-exec-15
 
=== May 10 ===
* 22:08 yuvipanda: created tools-precise-dev instance
* 09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
* 05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.
 
=== May 5 ===
* 18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab
 
=== May 4 ===
* 21:24 yuvipanda: reboot tools-submit, was stuck
 
=== May 2 ===
* 10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
* 10:13 yuvipanda: cleaned out wegrid  jobs from tools-webgrid-03
* 10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
* 08:56 yuvipanda: drained and deleted tools-webgrid-01
* 07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
* 07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
* 06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
* 03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
* 02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
* 01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
* 01:58 yuvipanda: increased tools instance quota
 
=== May 1 ===
* 03:55 YuviKTM: depooled and deleted tools-exec-20
* 03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node
 
=== April 30 ===
* 19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
* 19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
* 06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
* 06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
* 05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
* 05:40 YuviKTM: pooled in tools-exec-121{1-9}
* 05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
* 05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
* 05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
* 05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
* 05:39 YuviKTM: delete tools-exec-10, was out of jobs
* 04:28 YuviKTM: deleted tools-exec-09
* 04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
* 04:23 YuviKTM: repooled tools-exec-1201 is all good now
* 04:19 YuviKTM: rejuggle jobs again in trustyland
* 04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
* 04:08 YuviKTM: depooled tools-exec-09, apt troubles
* 04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
* 04:00 YuviKTM: pooled tools-exec-1406 and 1407
* 03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
* 03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
* 03:53 YuviKTM: depooled tools-exec-03 / 04
* 03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
* 03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
* 03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
* 03:18 YuviKTM: pooled tools-exec-1403, 1404
* 03:13 YuviKTM: pooled tools-exec-1402
* 03:07 YuviKTM: pooled tools-exec-1405
* 03:04 YuviKTM: pooled tools-exec-1401
* 02:53 YuviKTM: created tools-exec-14{06-10}
* 02:14 YuviKTM: created toolx-exec-14{01-05}
* 01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod
 
=== April 29 ===
* 19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: [[Nova_Resource:I-00000bca.eqiad.wmflabs]]
* 19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
* 19:28 YuviPanda: recreated tools-static-02
* 19:11 YuviPanda: failed over tools-static to tools-static-01
* 14:47 andrewbogott: deleting tools-exec-04
* 14:44 Coren: -exec-04 drained; removed from queues.  Rest well, old friend.
* 14:41 Coren: disabled -exec-04 (going away)
* 02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
* 02:27 YuviPanda: created tools-exec-12{01-10}
 
=== April 28 ===
* 21:41 andrewbogott: shrinking tools-master
* 21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
* 21:32 andrewbogott: shrinking tools-redis
* 21:28 YuviPanda: attempting to failover gridengine to tools-shadow
* 21:27 andrewbogott: shrinking tools-submit            |
* 21:21 YuviPanda: backup crontabs onto NFS
* 21:18 andrewbogott: shrinking  tools-webproxy-02
* 21:14 andrewbogott: shrinking  tools-static-01
* 21:11 andrewbogott: shrinking tools-exec-gift
* 21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
* 21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
* 21:01 YuviPanda: failover tools-static to tools-static-02
* 20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
* 20:43 andrewbogott: stopping, shrinking, starting tools-static-02
* 20:39 valhallasw`cloud: created tools-mailrelay-01 [[Nova_Resource:I-00000bac.eqiad.wmflabs]]
* 20:26 YuviPanda: failed over tools-services to services-01
* 18:11 Coren: reenabled -webgrid-generic-02
* 18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
* 17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
* 14:04 Coren: reenable -exec-11 for jobs.
* 13:55 andrewbogott: stopping tools-exec-11 for a resize experiment
 
=== April 25 ===
* 01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
* 01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug
 
=== April 24 ===
* 16:29 Coren: repooled -exec-02, -08, -12
* 16:05 Coren: -exec-02, -08 and -12 draining
* 15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
* 15:41 Coren: -exec-03 goes away for good.
* 15:31 Coren: draining -exec-03 to ease migration
* 13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot
 
=== April 23 ===
* 22:41 YuviPanda: disabled *@tools-exec-09
* 22:40 YuviPanda: add tools-exec-09 back to @general
* 22:38 YuviPanda: take tools-exec-09 from @general group
* 20:53 YuviPanda: restart bigbrother
* 20:28 YuviPanda: restarted nscd on tools-login and tools-dev
* 20:22 valhallasw`cloud: removed <code>10.68.16.4 tools-webproxy tools.wmflabs.org</code> from /etc/hosts
* 13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
* 01:00 YuviPanda: good bye tools-login.eqiad.wmflabs
 
=== April 20 ===
* 13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.
 
=== April 18 ===
* 20:09 YuviPanda:  sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
* 19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting
 
=== April 17 ===
* 01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02
 
=== April 16 ===
* 20:57 Coren: -webgrid-08 drained, rebooting
* 20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
* 20:45 Coren: -webgrid-03 drained, rebooting
* 20:38 Coren: -webgrid-03 depooled
* 20:38 Coren: -webgrid-02 repooled
* 20:35 Coren: -webgrid-02 drained, rebooting
* 20:33 Coren: -webgrid-02 depooled
* 20:32 Coren: -webgrid-01 repooled
* 20:06 Coren: -webgrid-01 drained, rebooting.
* 19:56 Coren: depooling -webgrid-01 for reboot
* 14:37 Coren: rebooting -master
* 14:29 Coren: rebooting -mail
* 14:22 Coren: rebooting -shadow
* 14:22 Coren: -exec-15 repooled
* 14:19 Coren: -exec-15 drained, rebooting.
* 13:46 Coren: -exec-14 repooled.  That's it for general exec nodes.
* 13:44 Coren: -exec-14 drained, rebooting.
 
=== April 15 ===
* 21:06 Coren: -exec-10 repooled
* 20:55 Coren: -exec-10 drained, rebooting
* 20:49 Coren: -exec-07 repooled.
* 20:47 Coren: -exec-07 drained, rebooting
* 20:43 Coren: -exec-06 requeued
* 20:41 Coren: -exec-06 drained, rebooting
* 20:15 Coren: repool -exec-05
* 20:10 Coren: -exec-05 drained, rebooting.
* 19:56 Coren: -exec-04 repooled
* 19:52 Coren: -exec-04 drained, rebooting.
* 19:41 Coren: disabling new jobs on remaining (exec) precise instances
* 19:32 Coren: repool -exec-02
* 19:30 Coren: draining -exec-04
* 19:29 Coren: -exec-02 drained, rebooting
* 19:28 Coren: -exec-03 rebooted, requeing
* 19:26 Coren: -exec-03 drained, rebooting
* 18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
* 18:43 Coren: tools-exec-01 back sans idmap, returning to pool
* 18:40 Coren: tools-exec-01 drained of jobs; rebooting
* 18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
* 18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.
 
=== April 14 ===
* 13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
* 13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
 
=== April 13 ===
* 21:11 YuviPanda: restart portgranter on all webgrid nodes
 
=== April 12 ===
* 10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== April 11 ===
* 21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
* 02:15 YuviPanda: rebooted tools-submit, was not responding
 
=== April 10 ===
* 07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
* 05:20 YuviPanda: delete the tomcat node finally :D
 
=== April 9 ===
* 23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
* 23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
* 08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
 
=== April 8 ===
* 13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
* 12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
* 09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.
 
=== April 7 ===
* 07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
 
=== April 5 ===
* 10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== April 4 ===
* 22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
* 08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
* 08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
* 03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).
 
=== April 3 ===
* 22:55 scfc_de: Removed empty cgi-bin directories.
* 20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== April 2 ===
* 20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
* 20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
* 01:25 YuviPanda: created tools-bastion-02
 
=== April 1 ===
* 00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).
 
=== March 31 ===
* 14:02 Coren: rebooting tools-submit
* 07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
* 07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
* 00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources.  It can be restarted any time.
 
=== March 30 ===
* 22:53 Coren: resyncing project storage with rsync
* 22:40 Coren: reboot tools-login
* 22:30 Coren: also bastion2
* 22:28 Coren: reboot bastion1 so users can log in
* 21:49 Coren: rebooting dedicated exec nodes.
* 21:49 Coren: rebooting tools-submit
* 17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
 
=== March 29 ===
* 19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.
 
=== March 28 ===
* 19:42 YuviPanda: created tools-exec-20
 
=== March 26 ===
* 21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== March 25 ===
* 16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
 
=== March 24 ===
* 16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
* 15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== March 23 ===
* 21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
* 21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
* 20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
* 03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up
 
=== March 22 ===
* 23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
* 23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
* 23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01
 
=== March 21 ===
* 16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
 
=== March 15 ===
* 22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
 
=== March 13 ===
* 16:23 YuviPanda: cleaned out / on tools-trusty
 
=== March 11 ===
* 04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
* 04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
* 03:56 YuviPanda: restarted redis server, it had OOM-killed
 
=== March 9 ===
* 11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
* 10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
* 10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
* 08:27 scfc_de: qmod -cq webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs (OOM of two jobs in the past).
 
=== March 7 ===
* 12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.
 
=== March 6 ===
* 07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
* 07:43 scfc_de: Deployed jobutils/misctools 1.4.
 
=== March 2 ===
* 09:53 YuviPanda: added ananthrk to project
* 08:41 YuviPanda: delete tools-uwsgi-01
* 08:11 YuviPanda: delete tools-uwsgi-02 because https://phabricator.wikimedia.org/T91065
 
=== March 1 ===
* 15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure
 
=== February 28 ===
* 07:51 YuviPanda: create tools-webgrid-07
* 01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
* 01:00 Coren: Also That was -webgrid-05
* 00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.
 
=== February 27 ===
* 17:53 YuviPanda: increased quota to 512G RAM and 256 cores
* 15:33 Coren: Switched back to -master.  I'm making a note here: great success.
* 15:27 Coren: Gridengine master failover test part three; killing the master with -9
* 15:20 Coren: Gridengine master failover test part deux - now with verbose logs
* 15:10 YuviPanda: created tools-webgrid-generic-02
* 15:10 YuviPanda: increase instance quota to 64
* 15:10 Coren: Master restarted - test not sucessful.
* 14:50 Coren: testing gridengine master failover starting now
* 08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well
 
=== February 24 ===
* 18:33 Coren: tools-submit not recovering well from outage, kicking it.
* 17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs
 
=== February 16 ===
* 02:31 scfc_de: rm -f /var/log/exim4/paniclog.
 
=== February 13 ===
* 18:01 Coren: tools-redis is dead, long live tools-redis
* 17:48 Coren: rebuilding tools-redis with moar ramz
* 17:38 legoktm: redis on tools-redis is OOMing?
* 17:26 marktraceur: restarting grrrit-wm because it's not behaving
 
=== February 1 ===
* 10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
* 07:51 YuviPanda: cleared error state of stuck queues
* 06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
* 05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
* 05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
* 04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
* 04:10 YuviPanda: widar moved to trusty
* 03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.
 
=== January 29 ===
* 17:26 YuviPanda: reschedule all tomcat jobs
 
=== January 27 ===
* 23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo
 
=== January 19 ===
* 20:51 YuviPanda: because valhallasw is nice
* 10:34 YuviPanda: manually started tools-webgrid-generic-01
* 09:48 YuviPanda: restarted toold-webgrid-03
* 08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
* 08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.
 
=== January 16 ===
* 22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.
 
=== January 15 ===
* 22:10 YuviPanda: created instance tools-webgrid-generic-01
 
=== January 11 ===
* 06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.
 
=== January 8 ===
* 07:40 YuviPanda: increase memory limit for autolist from 4G to 7G
 
=== December 23 ===
* 06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000
 
=== December 22 ===
* 07:43 YuviPanda: increased RAM and Cores quota for tools
 
=== December 19 ===
* 16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
* 12:00 YuviPanda|brb: created tools-static, static http server
* 07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== December 17 ===
* 22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get
 
=== December 12 ===
* 14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.
 
=== December 11 ===
* 07:58 YuviPanda: rebooted tools-login, wasn’t responsive.
 
=== December 8 ===
* 00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy
 
=== December 7 ===
* 06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
* 06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
 
=== December 2 ===
* 21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
* 21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 26 ===
* 19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty
 
=== November 25 ===
* 06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 24 ===
* 14:02 YuviPanda: rebooting tools-login, OOM'd
* 02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 22 ===
* 19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
 
=== November 17 ===
* 20:40 YuviPanda: cleaned out /tmp on tools-login
 
=== November 16 ===
* 21:31 matanya: back to normal
* 21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"
 
=== November 15 ===
* 07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda
 
=== November 14 ===
* 20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
* 18:26 YuviPanda: cleaned out core dumps on tools-webgrid
* 16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).
 
=== November 13 ===
* 21:11 YuviPanda: disable puppet on tools-dev to check shinken
* 21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
* 20:38 YuviPanda: didn't actually stop puppet, need more patches
* 20:38 YuviPanda: stopping puppet on tools-dev to test shinken
* 15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
* 13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.
 
=== November 12 ===
* 22:07 StupidPanda: enabled puppet on tools-exec-07
* 21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
* 21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
* 20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken
 
=== November 7 ===
* 13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).
 
=== November 6 ===
* 13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).
 
=== November 5 ===
* 19:15 mutante: exec nodes have p7zip-full now
* 10:07 YuviPanda: cleaned out pacct and atop logs on tools-login
 
=== November 4 ===
* 19:50 mutante: - apt-get clean on tools-login, and gzipped some logs
 
=== November 1 ===
* 12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).
 
=== October 30 ===
* 14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
* 06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp
 
=== October 27 ===
* 16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
* 15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.
 
=== October 26 ===
* 12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
* 12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
* 12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.
 
=== October 24 ===
* 20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006
 
=== October 23 ===
* 22:55 Coren: reboot tools-shadow, upstart seems hosed
 
=== October 14 ===
* 23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07
 
=== October 11 ===
* 15:31 andrewbogott: rebooting tools-master, stab in the dark
* 06:01 YuviPanda: restarted gridengine-master on tools-master
 
=== October 4 ===
* 18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b
 
=== October 2 ===
* 17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
* 17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools
 
=== September 28 ===
* 14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3
 
=== September 25 ===
* 14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now
 
=== September 17 ===
* 21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap
 
=== September 15 ===
* 11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work
 
=== September 13 ===
* 20:52 yuvipanda: cleaned out rotated log files on tools-webproxy
 
=== September 12 ===
* 21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase
 
=== September 8 ===
* 16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)
 
=== September 5 ===
* 22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
* 22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
* 18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"
 
=== September 4 ===
* 19:47 lokal-profil: local-heritage Renamed two swedish tables
 
=== September 2 ===
* 04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076
 
=== August 23 ===
* 17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11  : before job")
 
=== August 21 ===
* 20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils
 
=== August 15 ===
* 16:45 legoktm: fixed grrrit-wm
* 16:36 legoktm: restarting grrrit-wm
 
=== August 14 ===
* 22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529
 
=== August 12 ===
* 03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again
 
=== August 2 ===
* 16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
* 16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs
 
=== August 1 ===
* 22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")
 
=== July 24 ===
* 20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
* 14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts
 
=== July 21 ===
* 18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
* 18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
* 18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again
 
=== July 18 ===
* 14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
* 13:24 scfc_de: Made tools-webgrid-04 a grid submit host
* 12:58 scfc_de: Made tools-webgrid-03 a grid submit host
 
=== July 16 ===
* 22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
* 15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
* 15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st
 
=== July 15 ===
* 11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04
 
=== July 14 ===
* 20:40 andrewbogott: on tools-login
* 20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update
 
=== July 13 ===
* 13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
* 13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore
 
=== July 12 ===
* 17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
* 17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
* 13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
* 12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
* 11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later
 
=== July 11 ===
* 11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts
 
=== July 10 ===
* 20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
* 16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
* 01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup
 
=== July 9 ===
* 23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
* 23:09 YuviPanda: created tools-exec-13 with precise
* 23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
* 23:07 YuviPanda: created tools-exec-12
* 23:06 YuviPanda: created tools-exec-11
* 19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
* 14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
* 13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
* 12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
* 12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
* 12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
* 12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
* 12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
* 12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08
 
=== July 8 ===
* 20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
* 19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
* 18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
* 17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
* 17:26 scfc_de: tools-tcl-test: Rebooted because system said so
* 17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
* 14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
* 13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")
 
=== July 6 ===
* 12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged
 
=== July 5 ===
* 22:36 YuviPanda: cleared diamond archive logs on a bunch of machines, submitted patch to get rid of archive logs
* 22:17 YuviPanda: changed grid scheduling config, set weight_priority to 0.1 from 0.0 for https://bugzilla.wikimedia.org/show_bug.cgi?id=67555
 
=== July 4 ===
* 08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
* 00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond
 
=== July 3 ===
* 16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
* 16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
* 14:37 Betacommand: replication for enwiki is halted current lag is at 9876
 
=== July 2 ===
* 00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
* 00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats
 
=== July 1 ===
* 23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
* 21:08 scfc_de: Reset queues in error state again
* 17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
* 16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
* 15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
* 15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
* 15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
* 01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
* 01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
* 01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
* 01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*
 
=== June 30 ===
* 22:10 YuviPanda: ran webservice start for enwp10
* 22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
* 22:01 YuviPanda: removed stale lockfile for puppet, forcing run
* 19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
* 17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
* 17:27 YuviPanda: created tools-webgrid-03 and added it to the queue
 
=== June 29 ===
* 19:45 scfc_de: magnustools: "webservice start"
* 18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead
 
=== June 28 ===
* 21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy
 
=== June 21 ===
* 20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki
 
=== June 20 ===
* 21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
* 14:47 scfc_de: Restarted webservice for mono; cf. bug #64219
 
=== June 16 ===
* 23:50 scfc_de: Shut down diamond services and removed log files on all hosts
 
=== June 15 ===
* 17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
* 16:50 scfc_de: qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs
* 16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
* 16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)
 
=== June 13 ===
* 22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange
 
=== June 10 ===
* 21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
* 21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var
 
=== June 3 ===
* 17:50 Betacommand: Brief network outage. source:  It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.
 
=== June 2 ===
* 20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
* 19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though.  iipsrv.fcgi however has TMPDIR set as planned.
 
=== May 27 ===
* 18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
* 10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
* 10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
* 07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log
 
=== May 25 ===
* 14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors
 
=== May 23 ===
* 14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
* 14:10 andrewbogott: applying  role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors
 
=== May 22 ===
* 02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
* 02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
* 01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
* 01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
* 01:12 scfc_de: tools-mail: /var is full
 
=== May 20 ===
* 18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues
 
=== May 16 ===
* 17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, {{Gerrit|3861106688}})
 
=== May 14 ===
* 16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source {{Gerrit|127}}/8 --dport {{Gerrit|6379}} -j REJECT" to block connections from other Tools instances to Redis
* 00:23 Betacommand: {{Gerrit|503}}'s related to bug {{Gerrit|65179}}
 
=== May 13 ===
* 20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
* 20:36 valhallasw: redis failed, causing  tools-webproxy to thow {{Gerrit|503}}'s
* 19:09 marktraceur: Restarted grrrit because it had a stupid nick
 
=== May 10 ===
* 14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1
 
=== May 9 ===
* 13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes
 
=== May 6 ===
* 19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
* 17:53 scfc_de: Don't think replagstats is really working ...
* 16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #{{Gerrit|48694}}).
 
=== April 28 ===
* 11:51 YuviPanda: pywikibugs Deployed {{Gerrit|bf1be7b55a19457469f311ae54e1cf6409eb4a0b}}
 
=== April 27 ===
* 13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1
 
=== April 24 ===
* 23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
* 23:38 YuviPanda: restarted greg-g after cherry-picking {{Gerrit|aec09a6f669bc1806557576212aa218bfa520c35}} for auth of IRC bot
* 23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/{{Gerrit|129610}}
* 13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)
 
=== April 20 ===
* 14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
* 14:08 scfc_de: tools-redis: /var is full
* 08:59 legoktm: grrrit-wm: {{Gerrit|2014}}-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:{{Gerrit|6379}} failed - connect ECONNREFUSED
* 08:48 legoktm: Your job {{Gerrit|438884}} ("lolrrit-wm") has been submitted
* 08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)
 
=== April 13 ===
* 14:20 scfc_de: Restarted webservice for wikihistory to see if the change to PHP_FCGI_MAX_REQUESTS increases reliability
* 14:17 scfc_de: tools-webgrid-01, tools-webgrid-02: Set PHP_FCGI_MAX_REQUESTS to {{Gerrit|500}} in /usr/local/bin/lighttpd-starter per http://redmine.lighttpd.net/projects/1/wiki/docs_performancefastcgi#Why-is-my-PHP-application-returning-an-error-{{Gerrit|500}}-from-time-to-time
 
=== April 12 ===
* 23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")
 
=== April 11 ===
* 16:21 scfc_de: tools-login: Killed -HUP process consuming 2.6 GByte; cf. [[wikitech:User talk:Ralgis#Welcome to Tool Labs]]
 
=== April 10 ===
* 18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes
 
=== April 8 ===
* 05:06 Ryan_Lane: restart nginx on tools-proxy-test
* 05:03 Ryan_Lane: upgraded libssl on all nodes
 
=== April 4 ===
* 15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
* 11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request
 
=== March 30 ===
* 18:16 scfc_de: Removed empty directories /data/project/{{{Gerrit|d930913}},sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
* 18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier
 
=== March 29 ===
* 10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server
 
=== March 28 ===
* 11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
* 11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
* 11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login
 
=== March 21 ===
* 16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
* 16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis {{Gerrit|50383}} {{Gerrit|4096}} Mär 17 16:42", perhaps artifact of mass migration?)
* 15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:{{Gerrit|62132}}]]
* 15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
* 15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8
 
=== March 20 ===
* 18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:{{Gerrit|62883}}]]
 
=== March 5 ===
* 13:57 wm-bot: petrb: test
 
=== March 4 ===
* 22:35 wm-bot: petrb: uninstalling it from -login too
* 22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there
 
=== March 3 ===
* 19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
* 19:17 wm-bot: petrb: upgrading all packages on webserver-02
* 19:15 petan: rebooting webserver-01 which is totally dead
* 19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
* 19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
* 16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc.  "webservice start" for that, and pkill -HUP -u local-guc.
* 12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
* 12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
* 12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.
 
=== March 1 ===
* 03:42 Coren: disabled puppet in pmtpa tool labs\
 
=== February 28 ===
* 14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
* 00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "{{Gerrit|500}} Internal Server Error"
 
=== February 27 ===
* 15:28 scfc_de: chmod g-w ~fsainsbu/.forward
 
=== February 25 ===
* 22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.
 
=== February 23 ===
* 20:46 scfc_de: morebots: labs HUPped to reconnect to IRC
 
=== February 21 ===
* 17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
* 15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either
 
=== February 20 ===
* 21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at [[User talk:Reza#Running bots on tools-login, etc.]] ([[:fa:بحث_کاربر:Reza1615]] is write-protected)
* 20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at [[:ko:사용자토론:ChongDae#Running bots on tools-login, etc.]]
* 10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:{{Gerrit|61583}}]])
* 10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
* 10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch
 
=== February 19 ===
* 20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
* 19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127
 
=== February 18 ===
* 11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
* 11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
* 10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
* 10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
 
=== February 14 ===
* 23:54 legoktm: restarting grrrit-wm since it disappeared
* 08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
 
=== February 13 ===
* 13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
* 13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
* 10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help
 
=== February 12 ===
* 07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc
 
=== February 11 ===
* 15:47 scfc_de: Restarted webservice for geohack
* 13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
* 13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05
 
=== February 10 ===
* 23:16 Coren: rebooting webproxy (braindead autofs)
 
=== February 9 ===
* 18:14 legoktm: restarting grrrit-wm, it keeps joining and quitting
* 04:27 legoktm: rebooting grrrit-wm - https://gerrit.wikimedia.org/r/#/c/112308
 
=== February 6 ===
* 22:50 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/111889
 
=== February 4 ===