You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Release Engineering/SAL: Difference between revisions
Jump to navigation
Jump to search
imported>Labslogbot (beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked (hashar)) |
imported>Labslogbot (restarted nutcracker on deployment-mediawiki01 (bd808)) |
||
Line 1: | Line 1: | ||
== 2016-02-13 == | |||
* 06:42 bd808: restarted nutcracker on deployment-mediawiki01 | |||
* 06:32 bd808: jobrunner on deployment-jobrunner01 enabled after reverting changes from T87928 that caused T126830 | |||
* 05:51 bd808: disabled jobrunner process on jobrunner01; queue full of jobs broken by T126830 | |||
* 05:31 bd808: trebuchet clone of /srv/jobrunner/jobrunner broken on jobrunner01; failing puppet runs | |||
* 05:25 bd808: jobrunner process on deployment-jobrunner01 badly broken; investigating | |||
* 05:20 bd808: Ran https://phabricator.wikimedia.org/P2273 on deployment-jobrunner01.deployment-prep.eqiad.wmflabs; freed ~500M; disk utilization still at 94% | |||
== 2016-02-12 == | == 2016-02-12 == | ||
* 23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked | * 23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked |
Revision as of 06:42, 13 February 2016
2016-02-13
- 06:42 bd808: restarted nutcracker on deployment-mediawiki01
- 06:32 bd808: jobrunner on deployment-jobrunner01 enabled after reverting changes from T87928 that caused T126830
- 05:51 bd808: disabled jobrunner process on jobrunner01; queue full of jobs broken by T126830
- 05:31 bd808: trebuchet clone of /srv/jobrunner/jobrunner broken on jobrunner01; failing puppet runs
- 05:25 bd808: jobrunner process on deployment-jobrunner01 badly broken; investigating
- 05:20 bd808: Ran https://phabricator.wikimedia.org/P2273 on deployment-jobrunner01.deployment-prep.eqiad.wmflabs; freed ~500M; disk utilization still at 94%
2016-02-12
- 23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked
- 17:36 hashar: salt -v '*slave-trusty*' cmd.run 'apt-get -y install texlive-generic-extra' # T126422
- 17:32 hashar: adding texlive-generic-extra on CI slaves by cherry picking https://gerrit.wikimedia.org/r/#/c/270322/ - T126422
- 17:19 hashar: get rid of integration-dev it is broken somehow
- 17:10 hashar: Nodepool back at spawning instances. contintcloud has been migrated in wmflabs
- 16:51 thcipriani: running sudo salt '*' -b '10%' deploy.fixurl to fix deployment-prep trebuchet urls
- 16:31 hashar: bd808 added support for saltbot to update tasks automagically!!!! T108720
- 03:10 yurik: attempted to sync graphoid from gerrit 270166 from deployment-tin, but it wouldn't sync. Tried to git pull sca02, submodules wouldn't pull
2016-02-11
- 22:53 thcipriani: shutting down deployment-bastion
- 21:28 hashar: pooling back slaves 1001 to 1006
- 21:18 hashar: re enabling hhvm service on slaves ( https://phabricator.wikimedia.org/T126594 ) Some symlink is missing and only provided by the upstart script grrrrrrr https://phabricator.wikimedia.org/T126658
- 20:52 legoktm: deploying https://gerrit.wikimedia.org/r/270098
- 20:35 hashar: depooling the six recent slaves: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so cannot open shared object file
- 20:29 hashar: pooling integration-slave-trusty-1004 integration-slave-trusty-1005 integration-slave-trusty-1006
- 20:14 hashar: pooling integration-slave-trusty-1001 integration-slave-trusty-1002 integration-slave-trusty-1003
- 19:35 marxarelli: modifying deployment server node in jenkins to point to deployment-tin
- 19:27 thcipriani: running sudo salt -b '10%' '*' cmd.run 'puppet agent -t' from deployment-salt
- 19:27 twentyafterfour: Keeping notes on the ticket: https://phabricator.wikimedia.org/T126537
- 19:24 thcipriani: moving deployment-bastion to deployment-tin
- 17:59 hashar: recreated instances with proper names: integration-slave-trusty-{1001-1006}
- 17:52 hashar: Created integration-slave-trusty-{1019-1026} as m1.large (note 1023 is an exception it is for Android). Applied role::ci::slave , lets wait for puppet to finish
- 17:42 Krinkle: Currently testing https://gerrit.wikimedia.org/r/#/c/268802/ in Beta Labs
- 17:27 hashar: Depooling all the ci.medium slaves and deleting them.
- 17:27 hashar: I tried. The ci.medium instances are too small and MediaWiki tests really need 1.5GBytes of memory :-(
- 16:00 hashar: rebuilding integration-dev https://phabricator.wikimedia.org/T126613
- 15:27 Krinkle: Deploy Zuul config change https://gerrit.wikimedia.org/r/269976
- 11:46 hashar: salt -v '*' cmd.run '/etc/init.d/apache2 restart' might help for Wikidata browser tests failling
- 11:32 hashar: disabling hhvm service on CI slaves ( https://phabricator.wikimedia.org/T126594 , cherry picked both patches )
- 10:50 hashar: reenabled puppet on CI. All transitioned to a 128MB tmpfs (was 512MB)
- 10:16 hashar: pooling back integration-slave-trusty-1009 and integration-slave-trusty-1010 (tmpfs shrunken)
- 10:06 hashar: disabling puppet on all CI slaves. Trying to lower tmpfs 512MB to 128MB ( https://gerrit.wikimedia.org/r/#/c/269880/ )
- 02:45 legoktm: deploying https://gerrit.wikimedia.org/r/269853 https://gerrit.wikimedia.org/r/269893
2016-02-10
- 23:54 hashar_: depooling Trusty slaves that only have 2GB of ram that is not enough. https://phabricator.wikimedia.org/T126545
- 22:55 hashar_: gallium: find /var/lib/jenkins/config-history/config -type f -wholename '*/2015*' -delete ( https://phabricator.wikimedia.org/T126552 )
- 22:34 Krinkle: Zuul is back up and procesing Gerrit events, but jobs are still queued indefinitely. Jenkins is not accepting new jobs
- 22:31 Krinkle: Full restart of Zuul. Seems Gearman/Zuul got stuck. All executors were idling. No new Gerrit events processed either.
- 21:22 legoktm: cherry-picking https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster again
- 21:17 hashar: CI dust have settled. Krinkle and I have pooled a lot more Trusty slaves to accommodate for the overload caused by switching to php55 (jobs run on Trusty)
- 21:08 hashar: pooling trusty slaves 1009, 1010, 1021, 1022 with 2 executors (they are ci.medium)
- 20:38 hashar: cancelling mediawiki-core-jsduck-publish and mediawiki-core-doxygen-publish jobs manually. They will catch up on next merge
- 20:34 Krinkle: Pooled integration-slave-trusty-1019 (new)
- 20:28 Krinkle: Pooled integration-slave-trusty-1020 (new)
- 20:24 Krinkle: created integration-slave-trusty-1019 and integration-slave-trusty-1020 (ci1.medium)
- 20:18 hashar: created integration-slave-trusty-1009 and 1010 (trusty ci.medium)
- 20:06 hashar: creating integration-slave-trusty-1021 and integration-slave-trusty-1022 (ci.medium)
- 19:48 greg-g: that cleanup was done by apergos
- 19:48 greg-g: did cleanup across all integration slaves, some were very close to out of room. results: https://phabricator.wikimedia.org/P2587
- 19:43 hashar: Dropping slaves Precise m1.large integration-slave-precise-1014 and integration-slave-precise-1013 , most load shifted to Trusty (php53 -> php55 transition)
- 18:20 Krinkle: Creating a Trusty slave to support increased demand following MediaWIki php53(precise)>php55(trusty) bump
- 16:06 jzerebecki: reloading zuul for 41a92d5..5b971d1
- 15:42 jzerebecki: reloading zuul for 639dd40..41a92d5
- 14:12 jzerebecki: recover a bit of disk space: integration-saltmaster:~# salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/*WikibaseQuality*'
- 13:46 jzerebecki: reloading zuul for 639dd40
- 13:15 jzerebecki: reloading zuul for 3be81c1..e8e0615
- 08:07 legoktm: deploying https://gerrit.wikimedia.org/r/269619
- 08:03 legoktm: deploying https://gerrit.wikimedia.org/r/269613 and https://gerrit.wikimedia.org/r/269618
- 06:41 legoktm: deploying https://gerrit.wikimedia.org/r/269607
- 06:34 legoktm: deploying https://gerrit.wikimedia.org/r/269605
- 02:59 legoktm: deleting 14GB broken workspace of mediawiki-core-php53lint from integration-slave-precise-1004
- 02:37 legoktm: deleting /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer on trusty-1017, it had a skin cloned into it
- 02:26 legoktm: queuing mwext jobs server-side to identify failing ones
- 02:21 legoktm: deploying https://gerrit.wikimedia.org/r/269582
- 01:03 legoktm: deploying https://gerrit.wikimedia.org/r/269576
2016-02-09
- 23:17 legoktm: deploying https://gerrit.wikimedia.org/r/269551
- 23:02 legoktm: gracefully restarting zuul
- 22:57 legoktm: deploying https://gerrit.wikimedia.org/r/269547
- 22:29 legoktm: deploying https://gerrit.wikimedia.org/r/269540
- 22:18 legoktm: re-enabling puppet on all CI slaves
- 22:02 legoktm: reloading zuul to see if it'll pickup the new composer-php53 job
- 21:53 legoktm: enabling puppet on just integration-slave-trusty-1012
- 21:52 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ onto integration-puppetmaster
- 21:50 legoktm: disabling puppet on all trusty/precise CI slaves
- 21:40 legoktm: deploying https://gerrit.wikimedia.org/r/269533
- 17:49 marxarelli: disabled/enabled gearman in jenkins, connection works this time
- 17:49 marxarelli: performed stop/start of zuul on gallium to restore zuul and gearman
- 17:45 marxarelli: "Failed: Unable to Connect" in jenkins when testing gearman connection
- 17:40 marxarelli: killed old zull process manually and restarted service
- 17:39 marxarelli: restart of zuul fails as well. old process cannot be killed
- 17:38 marxarelli: reloading zuul fails with "failed to kill 13660: Operation not permitted"
- 16:06 bd808: Deleted corrupt integration-slave-precise-1003:/mnt/jenkins-workspace/workspace/mediawiki-core-php53lint/.git
- 15:11 hashar: mira: /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.13 php-1.27.0-wmf.13
- 14:51 hashar: ./make-wmf-branch -n 1.27.0-wmf.13 -o master
- 14:50 hashar: pooling back integration-slave-precise1001 - 1004. Manually fetched git repos in workspace for mediawiki core php53
- 14:49 hashar: make-wmf-branch instance: created a local ssh key pair and set the config to use User: hashar
- 14:13 hashar: pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is back .. Blame puppet
- 14:12 hashar: de pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is gone somehow
- 14:04 hashar: Manually git fetching mediawiki-core in /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint of slaves precise 1001 to 1004 (git on Precise is remarkably too slow)
- 13:28 hashar: salt '*trusty*' cmd.run 'update-alternatives --set php /usr/bin/hhvm'
- 13:28 hashar: salt '*precise*' cmd.run 'update-alternatives --set php /usr/bin/php5'
- 13:18 hashar: salt -v --batch=3 '*slave*' cmd.run 'puppet agent -tv'
- 13:15 hashar: removing https://gerrit.wikimedia.org/r/#/c/269370/ from CI puppet master
- 13:14 hashar: slave recurse infinitely doing /bin/bash -eu /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh then loop over /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin https://phabricator.wikimedia.org/T126327
- 12:46 hashar: Mass testing php loop of death: salt -v '*slave*' cmd.run 'timeout 2s /srv/deployment/integration/slave-scripts/bin/php --version'
- 12:40 hashar: mass rebooting CI slaves from wikitech
- 12:39 hashar: salt -v '*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
- 12:33 hashar: all slaves dieing due to PHP looping
- 12:02 legoktm: re-enabling puppet on all trusty/precise slaves
- 11:20 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster
- 11:20 legoktm: enabling puppet just on integration-slave-trusty-1012
- 11:13 legoktm: disabling puppet on all *(trusty|precise)* slaves
- 10:26 hashar: pooling in integration-slave-trusty-1018
- 03:19 legoktm: deploying https://gerrit.wikimedia.org/r/269359
- 02:53 legoktm: deploying https://gerrit.wikimedia.org/r/238988
- 00:39 hashar: gallium edited /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py and modified: replication_timeout = 300 -> replication_timeout = 10
- 00:37 hashar: live hacking Zuul code to have it stop sleeping() on force merge
- 00:36 hashar: killing zuul
2016-02-08
- 23:48 legoktm: finally deploying https://gerrit.wikimedia.org/r/269327
- 23:14 hashar: zuul promote --pipeline gate-and-submit --changes 269065,2 https://gerrit.wikimedia.org/r/#/c/269065/
- 23:10 hashar: pooling integration-slave-precise-1001 1002 1004
- 22:47 hashar: Err need to reboot newly provisioned instances before adding them to Jenkins (kernel upgrade,apache restart etc)
- 22:45 hashar: Pooled https://integration.wikimedia.org/ci/computer/integration-slave-precise-1003/
- 22:25 hashar: integration-slave-precise-{1001-1004} applied role::ci::slave::labs, running puppet in slaves. I have added the instances as Jenkins slaves and put them offline. Whenever puppet is done, we can mark them online in Jenkins then monitor the jobs running on them are working properly
- 22:15 hashar: Provisioning integration-slave-precise-{1001-1004} https://phabricator.wikimedia.org/T126274 (need more php53 slots)
- 22:13 hashar: Deleted cache-rsync instance superseded by castor instance
- 22:10 hashar: Deleting pmcache.integration.eqiad.wmflabs (was to investigate various kind of central caches).
- 20:14 marxarelli: aborting pending mediawiki-extensions-php53 job for CheckUser
- 20:08 bd808: toggled "Enable Gearman" off and on in Jenkins to wake up deployment-bastion workers
- 14:54 hashar: nodepool: refreshed snapshot image , Image ci-jessie-wikimedia-1454942958 in wmflabs-eqiad is ready
- 14:47 hashar: regenerated nodepool reference image (got rid of grunt-cli https://gerrit.wikimedia.org/r/269126 )
- 09:41 legoktm: deploying https://gerrit.wikimedia.org/r/269093 https://gerrit.wikimedia.org/r/269094
- 09:36 hashar: restarting integration puppetmaster (out of memory / cannot fork)
- 06:11 bd808: tgr set $wgAuthenticationTokenVersion on beta cluster (test run for T124440)
- 02:09 legoktm[NE]: deploying https://gerrit.wikimedia.org/r/268047
- 00:57 legoktm[NE]: deploying https://gerrit.wikimedia.org/r/268031
2016-02-06
- 18:34 jzerebecki: reloading zuul for bdb2ed4..46ccca9
2016-02-05
- 13:30 hashar: beta cleaning out /data/project/logs/archive was from pre logstash area. We no more log this way since May 2015 apparently
- 13:29 hashar: beta deleting /data/project/swift-disk created in august 2014 , unused since june 2015. Was a fail attempt at bringing swift to beta
- 13:27 hashar: beta: reclaiming disk space from extensions.git. On bastion: find /srv/mediawiki-staging/php-master/extensions/.git/modules -maxdepth 1 -type d -print -execdir git gc \;
- 13:03 hashar: integration-slave-trusty-1011 went out of disk space. Did some brute clean up and git gc.
- 05:21 Tim: configured mediawiki-extensions-qunit to only run on integration-slave-trusty-1017, did a rebuild and then switched it back
2016-02-04
- 22:08 jzerebecki: reloading zuul for bed7be1..f57b7e2
- 21:51 hashar: salt-key -d integration-slave-jessie-1001.eqiad.wmflabs
- 21:50 hashar: salt-key -d integration-slave-precise-1011.eqiad.wmflabs
- 00:57 bd808: Got deployment-bastion processing Jenkins jobs again via instructions left by my past self at https://phabricator.wikimedia.org/T72597#747925
- 00:43 bd808: Jenkins agent on deployment-bastion.eqiad doing the trick where it doesn't pick up jobs again
2016-02-03
- 22:24 bd808: Manually ran sync-common on deployment-jobrunner01.eqiad.wmflabs to pickup wmf-config changes that were missing (InitializeSettings, Wikibase, mobile)
- 17:43 marxarelli: Reloading Zuul to deploy previously undeployed Icd349069ec53980ece2ce2d8df5ee481ff44d5d0 and Ib18fe48fe771a3fe381ff4b8c7ee2afb9ebb59e4
- 15:12 hashar: apt-get upgrade deployment-sentry2
- 15:03 hashar: redeployed rcstream/rcstream on deployment-stream by using git-deploy on deployment-bastion
- 14:55 hashar: upgrading deployment-stream
- 14:42 hashar: pooled back integration-slave-trusty-1015 Seems ok
- 14:35 hashar: manually triggered a bunch of browser tests jobs
- 11:40 hashar: apt-get upgrade deployment-ms-be01 and deployment-ms-be02
- 11:32 hashar: fixing puppet.conf on deployment-memc04
- 11:09 hashar: restarting beta cluster puppetmaster just in case
- 11:07 hashar: beta: apt-get upgrade on delpoyment-cache* hosts and checking puppet
- 10:59 hashar: integration/beta: deleting /etc/apt/apt.conf.d/*proxy files. There is no need for them, in fact web proxy is not reachable from labs
- 10:53 hashar: integration: switched puppet repo back to 'production' branch, rebased.
- 10:49 hashar: various beta cluster have puppet errors ..
- 10:46 hashar: integration-slave-trusty-1013 heading to out of disk space on /mnt ...
- 10:42 hashar: integration-slave-trusty-1016 out of disk space on /mnt ...
- 03:45 bd808: Puppet failing on deployment-fluorine with "Error: Could not set uid on user[datasets]: Execution of '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists"
- 03:44 bd808: Freed 28G by deleting deployment-fluorine:/srv/mw-log/archive/*2015*
- 03:42 bd808: Ran deployment-bastion.deployment-prep:/home/bd808/cleanup-var-crap.sh and freed 565M
2016-02-02
- 18:32 marxarelli: Reloading Zuul to deploy If1f3cb60f4ccb2c1bca112900dbada03a8588370
- 17:42 marxarelli: cleaning mwext-donationinterfacecore125-testextension-php53 workspace on integration-slave-precise-1013
- 17:06 ostriches: running sync-common on mw2051 and mw1119
- 09:38 hashar: Jenkins is fully up and operational
- 09:33 hashar: restarting Jenkins
- 08:47 hashar: pooling back integration-slave-precise1011 , puppet run got fixed ( https://phabricator.wikimedia.org/T125474 )
- 03:48 legoktm: deploying https://gerrit.wikimedia.org/r/267828
- 03:29 legoktm: deploying https://gerrit.wikimedia.org/r/266941
- 00:42 legoktm: due to T125474
- 00:42 legoktm: marked integration-slave-precise-1011 as offline
- 00:39 legoktm: precise-1011 slave hasn't had a puppet run in 6 days
2016-02-01
- 23:53 bd808: Logstash working again; I applied a change to the default mapping template for Elasticsearch that ensures that fields named "timestamp" are indexed as plain strings
- 23:46 bd808: Elasticsearch index template for beta logstash cluster making crappy guesses about syslog events; dropped 2016-02-01 index; trying to fix default mappings
- 23:09 bd808: HHVM logs causing rejections during document parse when inserting in Elasticsearch from logstash. They contain a "timestamp" field that looks like "Feb 1 22:56:39" which is making the mapper in Elasticsearch sad.
- 23:04 bd808: Elasticsearch on deployment-logstash2 rejecting all documents with 400 status. Investigating
- 22:50 bd808: Copying deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log to /srv for debugging later
- 22:48 bd808: deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log is 11G of fail!
- 22:46 bd808: root partition on deployment-logstash2 full
- 22:43 bd808: No data in logstash since 2016-01-30T06:55:37.838Z; investigating
- 15:33 hashar: Image ci-jessie-wikimedia-1454339883 in wmflabs-eqiad is ready
- 15:01 hashar: Refreshing Nodepool image. Might have npm/grunt properly set up
- 03:15 legoktm: deploying https://gerrit.wikimedia.org/r/267630
2016-01-31
- 13:35 hashar: Jenkins IRC bot started falling at Jan 30 01:04:00 2016 for whatever reason.... Should be fine now
- 13:33 hashar: cancelling/aborting jobs that are stuck while reporting to IRC (mostly browser tests and beta cluster jobs)
- 13:32 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
- 13:28 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
2016-01-30
- 12:46 hashar: integration-slave-jessie-1001 : fixed puppet.con server name and ran puppet
2016-01-29
- 18:43 thcipriani: updated scap on beta
- 16:44 thcipriani: deployed scap updates on beta
- 11:58 _joe_: upgraded hhvm to 3.6 wm8 in deployment-prep
2016-01-28
- 23:22 MaxSem: Updated portals on betalabs to master
- 22:23 hashar: salt '*slave-precise*' cmd.run 'apt-get install php5-ldap' ( https://phabricator.wikimedia.org/T124613 ) will need to be puppetized
- 18:17 thcipriani: cleaning npm cache on slave machines: salt -v '*slave*' cmd.run 'sudo -i -u jenkins-deploy -- npm cache clean'
- 18:12 thcipriani: running npm cache clean on integration-slave-precise-1011 sudo -i -u jenkins-deploy -- npm cache clean
- 15:25 hashar: apt-get upgrade deployment-sca01 and deployment-sca02
- 15:09 hashar: fixing puppet.conf hostname on deployment-upload deployment-conftool deployment-tmh01 deployment-zookeeper01 and deployment-urldownloader
- 15:06 hashar: fixing puppet.con hostname on deployment-upload.deployment-prep.eqiad.wmflabs and running puppet
- 15:00 hashar: Running puppet on deployment-memc02 and deployment-elastic07 . It is catching up with lot of changes
- 14:59 hashar: fixing puppet hostnames on deployment-elastic07
- 14:59 hashar: fixing puppet hostnames on deployment-memc02
- 14:55 hashar: Deleted salt keys deployment-pdf01.eqiad.wmflabs and deployment-memc04.eqiad.wmflabs (obsolete, entries with '.deployment-prep.' are already there)
- 07:38 jzerebecki: reload zuul for 4951444..43a030b
- 05:55 jzerebecki: doing https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
- 03:49 mobrovac: deployment-prep re-enabled puppet on deployment-restbase0x
- 02:49 mobrovac: deployment-prep deployment-restbase01 disabled puppet to set up cassandra for
- 02:27 mobrovac: deployment-prep recreating deployment-restbase01 for T125003
- 02:23 mobrovac: deployment-prep deployment-restbase02 disabled puppet to recreate deployment-restbase01 for T125003
- 01:42 mobrovac: deployment-prep recreating deployment-sca02 for T125003
- 01:28 mobrovac: deployment-prep recreating deployment-sca01 for T125003
- 00:36 mobrovac: deployment-prep re-imaging deployment-mathoid for T125003
- 00:02 jzerebecki: integration-slave-trusty-1016:~$ sudo -i rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/Donate
2016-01-27
- 23:49 jzerebecki: integration-slave-precise-1011:~$ sudo -i /etc/init.d/salt-minion restart
- 23:46 jzerebecki: work around https://phabricator.wikimedia.org/T117710 : salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky'
- 21:19 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf (should be no-op after yesterday's deploy)
- 10:29 hashar: triggered bunch of browser tests, deployment-redis01 was dead/faulty
- 10:08 hashar: mass restarting redis-server process on deployment-redis01 (for https://phabricator.wikimedia.org/T124677 )
- 10:07 hashar: mass restarting redis-server process on deployment-redis01
- 09:00 hashar: beta: commenting out "latency-monitor-threshold 100" parameter from any /etc/redis/redis.conf we have ( https://phabricator.wikimedia.org/T124677 ). Puppet will not reapply it unless distribution is Jessie
2016-01-26
- 16:51 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
- 12:14 hashar: Added Jenkins IRC bot (wmf-insecte) to #wikimedia-perf for https://gerrit.wikimedia.org/r/#/c/265631/
- 09:30 hashar: restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/
- 04:18 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build (27 hours after the last time I did that)
2016-01-25
- 18:59 twentyafterfour: started redis-server on deployment-redis01 by commenting out latency-monitor-threshold from the redis.conf
- 15:22 hashar: CI: fixing kernels not upgrading via: rm /boot/grub/menu.lst ; update-grub -y (i.e.: regenerate the Grub menu from scratch)
- 14:21 hashar: integration-slave-trusty-1015.integration.eqiad.wmflabs is gone. I have failed the kernel upgrade / grub update
- 01:35 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build
2016-01-24
- 06:45 legoktm: deploying https://gerrit.wikimedia.org/r/266039
- 06:13 legoktm: deploying https://gerrit.wikimedia.org/r/266041
2016-01-22
- 23:58 legoktm: removed skins from mwext-qunit workspace on trusty-1013 slave
- 23:34 legoktm: rm -rf /mnt/jenkins-workspace/workspace/mediawiki-phpunit-php53 on slave precise 1012
- 22:45 legoktm: deploying https://gerrit.wikimedia.org/r/265864
- 22:27 hashar: rebooted all CI slaves using OpenStackManager
- 22:09 hashar: rebooting deployment-redis01 (kernel upgrade)
- 21:22 hashar: Image ci-jessie-wikimedia-1453497269 in wmflabs-eqiad is ready (with node 4.2 for https://phabricator.wikimedia.org/T119143 )
- 21:14 hashar: updating nodepool snapshot based on new image
- 21:12 hashar: rebuilding nodepool reference image
- 20:04 hashar: Image ci-jessie-wikimedia-1453492820 in wmflabs-eqiad is ready
- 20:00 hashar: Refreshing nodepool image to hopefully get Nodejs 4.2.4 https://phabricator.wikimedia.org/T124447 https://gerrit.wikimedia.org/r/#/c/265802/
- 16:32 hashar: Nuked corrupted git repo on integration-slave-precise-1012 /mnt/jenkins-workspace/workspace/mediawiki-extensions-php53
- 12:23 hashar: beta: reinitialized keyholder on deployment-bastion. The proxy apparently had no identity
- 09:32 hashar: beta cluster Jenkins job have been stalled for 9hours and 25 minutes. Disabling/reenabling the Gearman plugin to remove the deadlock
2016-01-21
- 21:41 hashar: restored role::mail::mx on deployment-mx
- 21:36 hashar: dropping role::mail::mx from deployment-mx to let puppet run
- 21:33 hashar: rebooting deployment-jobrunner01 / kernel upgrade / /tmp is only 1MBytes
- 21:19 hashar: fixing up deployment-jobrunner01 /tmp and / disks are full
- 19:57 thcipriani: ran REPAIR TABLE globalnames; on centralauth db
- 19:48 legoktm: deploying https://gerrit.wikimedia.org/r/265552
- 19:39 legoktm: deploying jjb changes for https://gerrit.wikimedia.org/r/264990
- 19:25 legoktm: deploying https://gerrit.wikimedia.org/r/265546
- 01:59 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions/SpellingDictionary$ rm -r modules/jquery.uls && git rm modules/jquery.uls
- 01:00 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git pull && git submodule update --init --recursive
- 00:57 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git reset HEAD SpellingDictionary
2016-01-20
- 20:05 hashar: beta sudo find /data/project/upload7/math -type f -delete (probably some old left over)
- 19:50 hashar: beta: on commons ran deleteArchivedFile.php : Nuked 7130 files
- 19:49 hashar: beta : foreachwiki deleteArchivedRevisions.php -delete
- 19:26 hasharAway: Nuked all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload
- 19:19 hasharAway: beta: sudo find /data/project/upload7/*/*/temp -type f -delete
- 19:14 hasharAway: beta: sudo rm /data/project/upload7/*/*/lockdir/*
- 18:57 hasharAway: beta cluster code has been stalled for roughly 2h30
- 18:55 hasharAway: disconnecting Gearman plugin to remove deadlock for beta cluster rjobs
- 17:06 hashar: clearing files from beta-cluster to prepare for Swift migration. python pwb.py delete.py -family:betacommons -lang:en -cat:'GWToolset Batch Upload' -verbose -putthrottle:0 -summary:'Clearing out old batched upload to save up disk space for Swift migration'
2016-01-19
- 22:25 legoktm: deleting *zend* workspaces on precise slaves
- 21:58 thcipriani: trying https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update again
- 21:57 thcipriani: beta-scap-eqiad still can't find executor on deployment-bastion.eqiad
- 21:52 thcipriani: following steps at https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update for deployment-bastion
- 19:34 legoktm: deleting all *zend* jobs from jenkins
- 09:40 hashar: Created github repo https://github.com/wikimedia/operations-debs-varnish4
- 03:59 legoktm: deploying https://gerrit.wikimedia.org/r/264912 and https://gerrit.wikimedia.org/r/264922
2016-01-17
- 18:02 legoktm: deploying https://gerrit.wikimedia.org/r/264605
2016-01-16
- 21:47 legoktm: deploying https://gerrit.wikimedia.org/r/264489
- 21:36 legoktm: deploying https://gerrit.wikimedia.org/r/264488
- 21:29 legoktm: deploying https://gerrit.wikimedia.org/r/264487
- 21:21 legoktm: deploying https://gerrit.wikimedia.org/r/264483 https://gerrit.wikimedia.org/r/264485
- 20:58 legoktm: deploying https://gerrit.wikimedia.org/r/264492
- 18:55 jzerebecki: reloadin zuul for 996c558..5f8eb50
- 09:12 legoktm: deploying https://gerrit.wikimedia.org/r/264448
- 09:01 legoktm: deploying https://gerrit.wikimedia.org/r/264446 and https://gerrit.wikimedia.org/r/264447
- 07:46 legoktm: sudo -u jenkins-deploy mv /mnt/jenkins-workspace/workspace/mediawiki-core-phplint /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint on all precise slaves
- 07:17 legoktm: deploying https://gerrit.wikimedia.org/r/264444
- 06:31 legoktm: deploying https://gerrit.wikimedia.org/r/264441
- 06:10 legoktm: added phpflavor-php53 label to all phpflavor-zend slaves
2016-01-15
- 12:17 hashar: restarting Jenkins for plugins updates
- 02:49 bd808: Trying to fix submodules in deployment-bastion:/srv/mediawiki-staging/php-master/extensions for T123701
2016-01-14
- 20:06 legoktm: deploying https://gerrit.wikimedia.org/r/264122
- 19:32 legoktm: deploying https://gerrit.wikimedia.org/r/264114
- 19:18 legoktm: deploying https://gerrit.wikimedia.org/r/264108
2016-01-13
- 21:06 hashar: beta cluster code is up to date again. Got delayed by roughly 4 hours.
- 20:55 hashar: unlocked Jenkins jobs for beta cluster by disabling/reenabling Jenkins Gearman client
- 10:15 hashar: beta: fixed puppet on deployment-elastic06 . Was still using cert/hostname without .deployment-prep. .... Mass update occurring.
2016-01-12
- 23:30 legoktm: deploying https://gerrit.wikimedia.org/r/263757 https://gerrit.wikimedia.org/r/263756
- 13:32 hashar: beta cluster: running /usr/local/sbin/cleanup-pam-config
- 13:29 hashar: integration running /usr/local/sbin/cleanup-pam-config on slaves
2016-01-11
- 22:24 hashar: Deleting old references on Zuul-merger for mediawiki/core : /usr/share/python/zuul/bin/python /home/hashar/zuul-clear-refs.py --until 15 /srv/ssd/zuul/git/mediawiki/core
- 22:21 hashar: gallium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
- 22:21 hashar: scandium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
- 07:35 legoktm: deploying https://gerrit.wikimedia.org/r/263319
2016-01-07
- 23:16 legoktm: deleted /mnt/jenkins-workspace/workspace/mediawiki-extensions-qunit/src/extensions/PdfHandler/.git/refs/heads/wmf/1.26wmf16.lock on slave 1013
- 06:32 legoktm: deploying https://gerrit.wikimedia.org/r/262868
- 02:24 legoktm: deploying https://gerrit.wikimedia.org/r/262855
- 01:25 jzerebecki: reloading zuul for b0a5335..c16368a
2016-01-06
- 21:13 thcipriani: kicking integration puppetmaster, weird node unable to find definition.
- 21:11 jzerebecki: on scandium: sudo -u zuul rm -rf /srv/ssd/zuul/git/mediawiki/services/mathoid
- 21:04 legoktm: ^ on gallium
- 21:04 legoktm: manually deleted /srv/ssd/zuul/git/mediawiki/services/mathoid to force zuul to re-clone it
- 20:17 hashar: beta: dropped a few more /etc/apt/apt.conf.d/*-proxy files. webproxy is no more reachable from labs
- 09:44 hashar: CI/beta: deleting all git tags from /var/lib/git/operations/puppet and doing git repack
- 09:39 hashar: restoring puppet hacks on beta cluster puppetmaster.
- 09:35 hashar: beta/CI: salt -v '*' cmd.run 'rm -v /etc/apt/apt.conf.d/*-proxy' https://phabricator.wikimedia.org/T122953
2016-01-05
- 16:54 hashar_: Removed elastic search from CI slaves https://phabricator.wikimedia.org/T89083 https://gerrit.wikimedia.org/r/#/c/259301/
- 03:45 Krinkle: integration-slave-trusty-1015: rm -rf /mnt/home/jenkins-deploy/.npm per https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/56577/console
2016-01-04
- 21:06 hashar: gallium has puppet enabled again
- 20:53 hashar: stopping puppet on gallium and live hacking Zuul configuration for https://phabricator.wikimedia.org/T122656
2016-01-02
- 03:17 yurik: purged varnishs on deployment-cache-text04
2016-01-01
- 22:17 bd808: No nodepool ci-jessie-* hosts seen in Jenkins interface and rake-jessie jobs backing up
Archive
- Archive 1 (September 2014 - December 2015)