You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Server admin log/test

From Wikitech-static
Jump to navigation Jump to search

<logentry />

October 18

  • 00:03 Brion: trying a log thingay

October 17

  • 21:10 brion: enabled Commons foreign image repo on Wikitech
  • 18:45 brion: created Wikimedia-Boston list for SJ
  • 16:55 brion: adding nomcomwiki to special.dblist so it shows up right in sitematrix
  • 16:45 brion: deleted some junk comments from bugzilla
  • 16:31 brion: changed autoconfirm settings for 'fishbowl' wikis -- 0 age for autoconfirm, plus set upload & move for all users just in case autoconfirm doesn't kick in right
  • 14:22 RobH: srv131 back up.
  • 09:03 Tim: copying srv129 and srv139 ES data directories to storage2:/export/backup
  • 02:49 Tim: excessive lag on db16, killed long-running queries and temporarily depooled. CUPS odyssey continues.
  • 01:59 Tim: removing cups on all servers where it is running
  • 00:00 RobH: restarted srv43-47

October 16

  • 20:42 brion: added 3 more dump threads on srv31... we need to find some more batch servers to work with for the time being until new dump system is in place :)
  • 20:20 RobH: pulled samuel from the rack, decommissioned, RIP samuel.
  • 19:35 RobH: migrated rack B4 from asw3 to asw-b4-pmtpa.
  • 18:40 RobH: rebooted scs-ext opps!
  • 18:26 RobH: srv61 reinstalled and redeployed.
  • 18:24 RobH: Adler re-racked with rails, booted up to maintenance mode prompt.
  • 17:34 mark: 208.80.152.0/25 NTP restriction is actually also not broad enough - changed it to /22 in ntpd.conf on zwinger
  • 17:02 brion: thumbnails on commons are insanely slow and/or broken
  • 14:44 Tim: added a more comprehensive redirection list to squid.conf.php for storage1 images
  • 14:04 Tim: redirected images for /wikipedia/en/ to storage1, apparently they were moved a while ago. Refactored the relevant squid.conf section.
  • 13:38 Tim: disabled directory index on amane. Was generating massive amounts of NFS traffic by generating a directory index for some timeline directories.
  • 12:51 Tim: increased memory limit on srv159 to 8x200MB. Still well under physical.
  • 11:38 Tim: cleaned up temporary files on srv159, had filled its disk
  • 11:25 Tim: synced upload scripts (including to ms1)
  • 10:06 Tim: removed sq50 from the squid node lists and uninstalled squid on it
  • 09:22 - 09:52 mark, Tim, JeLuF: initial attempts to bring the squids back up failed due to incorrect permissions on the recreated swap logs. Most were back up by around 09:32, except newer knams and yaseo squids which were missing from the squids_global node group. The node group was updated and the remainder of the squids brought up around 09:52.
  • 09:19 JeLuF: deployed squid.conf with an error in it. All squid instances exited.
  • 08:26 Tim: Restarted ntpd on search7, was broken
  • 06:42 Tim: ntp.conf on zwinger had the wrong netmask for the 208.x net, it was /26 instead of /25. So a lot of squids were out of it, and some had a clock skew of 10 minutes (as visible on ganglia). Fixed ntp.conf, not stepped yet. Will affect squid logs.

October 15

  • 19:49 brion: added '<span onmouseover="_tipon' to spam regex; some kind of weird edit submissions coming with this stuff like [1]
  • 12:00 Tim: trying to bring srv159 up as an image scaler. Limiting memory usage to 8x100 = 800MB with MediaWiki.
  • 11:21 srv127 died just the same. Mark suggests using one with DRAC next.
  • 10:20 Tim: all image scalers (srv43 and srv100) swapped to death again. Preparing srv127 as an image scaler with swap off.
  • 08:43 Tim: reduced depool-threshold for the scalers to 0.1 since srv100 is quite capable of handling the load by itself while we're waiting for the other servers to come back up.
  • 07:45 Tim: half the scaling cluster went down again, ganglia shows high system CPU. Installing wikimedia-task-scaler on srv100.
  • 02:30 Tim: moved image scalers into their own ganglia cluster
  • 02:17 Tim: apache on srv43-47 hadn't been restarted and so was still running without -DSCALER. This partially explains the swapping. Restarted them. Took srv38-39 back out of the image scaler pool, they have different rsvg and ffmpeg binary paths and break without a MediaWiki reconfiguration.
  • 02:13 tomasz: upgraded srv9 to ubuntu 8.04
  • 02:00 tomasz: upgraded srv9 to ubuntu 7.10

October 14

  • 19:16 brion: restarted lighty on storage1 again -- it was back in 'fastcgi overloaded' mode, possibly due to the previously broken backend, possibly not
  • 19:11 mark: Pooled old scaling servers srv38, srv39
  • 18:50 brion: at least four of new image scalers are down -- can't reach by SSH. thumbnailing is borked
  • 16:41 brion: fixed image scaling for now -- storage1 fastcgi backends were overloaded, so it was rejecting things. did some killall -9s to shut them all down and restarted lighty. ok so far
  • 16:20 brion: image scaling is broken in some way, investigating
  • 02:54 Tim: fixed srv43-47, this is now the image scaling cluster
  • 00:10 Tim: oops, forgot to add VIPs, switched back.
  • 00:05 Tim: switched image scaling LVS to srv43-47

October 13

  • 23:45 Tim: prepping srv43-47 as image scaling servers
  • 21:45 jeluf: moved more image directories to ms1. Now, upload/wikipedia/[abghijmnopqrstuwxy]* are on ms1
  • 21:35 jeluf: killed mwsearchd on srv39, removed both the rc3.d link and the cronjob that start mwsearchd
  • 21:30 RobH: search8 and search9 are online, awaiting configuration.
  • 21:15 brion: thumb rendering failures reported... found some runaway convert procs poking at an animated GIF, killed them.
    • rev:42058 will force GIFs over 1 megapixel to render a single frame instead of animations as a quick hackaround...
  • 20:48 domas: thistle serving as s2a server
  • 20:28 RobH: stopping mysql on adler so it can be re-racked with rails.
  • 19:53 RobH: search7 back online, awaiting addition to the search cluster.
  • 19:35 mark: Set up an Exim instance on srv9 for outgoing donation mail, as well as incoming for delivery into IMAP for CiviMail (*spit*).
  • 17:00 RobH: srv21-srv29 decommissioned and unracked.
  • 12:05 domas: put lomaria back in rotation
  • 11:50 domas: Enabled write-behind caching on db15. Restarted.
  • 10:40 domas: restarted replication on db15 and lomaria
  • 10:27 domas: loading dewiki data from SQL dump into thistle
  • 09:09 Tim: restarted logmsgbot
  • 08:27 Tim: folded s2b back into s2
  • 08:06 Tim: db13 in rotation
  • 08:02 domas: copying from db15 to lomaria
  • 07:38 Tim: started replication on db13
  • 04:51 Tim: copying
  • 03:27 Tim: Preparing for copy from db15 to db13
  • 00:00 domas: something wrong with db15 i/o performance. it is behaving way worse, than it should.

October 12

  • 23:58 brion: updated CodeReview to add a commit so loadbalancer saves our master position. playing with serverstatus extension on yongle to find out wtf it keeps getting stuck
  • 22:05 brion: db15 sucks hard. putting categories back to db13
  • 22:01 brion: db15 got all laggy with the load. taking back out of general rotation, leaving it on categories/recentchangeslinked
  • 21:58 brion: db15 seems all happy. swapping it in in place of db13, and giving it some general load on s2. we'll have to resync db13 at some point? and toolserver?
  • 19:41 Tim: shutting down db15 for restart with innodb_flush_log_at_trx_commit=2. But db8 seems to be handling the load now so I'm going to bed.
  • 19:20 Tim: depooled db15.
  • 19:09 Tim: split off some wikis into s2b and put db8 on it. To reduce I/O and hopefully stop the lag.
  • 18:51 Tim: db15 still chronically lagged. Offloading all s2 RCL and category queries to db13.
  • 18:38 Tim: offloading commons RCL queries to db13
  • 18:36 Tim: dewiki r/w with ixia (master) only
  • 18:33 Tim: offloading commons category queries to db13
  • 18:25 Tim: balancing load. Fixed ganglia on various mysql servers.
  • 18:06 Tim: going to r/w on s2. Not s2a yet because db15/db8 can't handle the load.
  • 17:46 Tim: db8->db15 copy finished, deploying
  • 17:33 Tim: installed NRPE on thistle.
  • 16:54 Tim: copied mysqld binaries from db11 to db15 and thistle. Plan for thistle is to use it for s2a.
  • 16:40 Tim: ixia/db8 can't handle the load between them with db13 out, even with s2a diverted. Restored db13 to the pool. Running out of candidates for a copy destination. Need db13 in because it's keeping the site up, can't copy to thistle because it's too small with RAID 10. Plan B: set up virgin server db15. Copying from db8.
  • 16:07 Tim: repooled ixia/db8 r/o
  • 15:53 Tim: removed ixia binlogs 290-349. 270-289 were deleted during the initial response.
  • 14:54 mark: Pooled search6 as part of search cluster 2, by request of rainman
  • 14:37 Tim: deployed r41995 as a live patch to replace buggy temp hack.
  • 14:14 Tim: cleaned up binlogs on db2. Yes the horse has bolted, but we may as well shut the gate.
  • 14:11 Tim: copy now in progress as planned.
  • 13:48 Tim: going to try the resync option. Maybe with s2 it won't take as long as s1. Will try to sync up db8 from ixia with db13 serving read-only load for the duration of the copy.
  • 13:40 Tim: ixia (s2 master) disk full. Classic scenario, binlogs stopped first, writing continued for 10 minutes before replag was reported.
  • 13:00 jeluf: moved wikipedia/m* image directories to ms1
  • 08:00 jeluf: restarted lighttpd on ms1, directory listings are now disabled.
  • 02:55 Tim: attempted to disable directory listing on ms1. Gave up after a while.

October 11

  • 7:00 jeluf: moved wikipedia/s* image directories to ms1

October 10

  • 21:30 jeluf: moved wikipedia/[jqtuwxy]* to ms1
  • 19:20 RobH: Bayes online.
  • 19:11 brion: recreated special page update logs in /home/wikipedia/logs, hopefully fixing special page updates
  • 13:05 Tim: reverted live patch and merged properly tested fix r41928 instead.
  • 12:31 Tim: deployed a live patch to fix a regression in MessageCache::loadFromDB() concurrency limiting lock
  • 12:17 domas: killed long running threads
  • ~12:04: s2 down due to slave server overload

October 9

  • 22:52 brion: enabled Collection on de.wikibooks so they can try it out
  • 20:00 jeluf: moved wikipedia/i* images to ms1
  • 17:05 RobH: thistle raid died due to hdd failed, replaced hdd, reinstalled as raid10.
  • 12:00 domas: switched s3 master to db1, did erase bunch of db.php stuff by accident (don't know how :). restored from db.php~ :-)
  • 09:31 mark: pascal died yet again, revived it. Will move the htcp proxy tonight...

October 8

  • 21:05 brion: yongle still gets stuck from time to time, breaking mobile, apple search, and svn-proxy. i suspect svn-proxy but can't easily prove it still. using separate svn command (in theory) but it's not showing me stuck processes.
  • ??:?? rob fixed srv37, then later srv133 into mediawiki-installation node group. he did an audit and didn't see any other problems. i ran a scap to make sure all are now up to date
    • Speculation: possible that rumored ongoing image disappearances have been caused by the image-destruction bug still being in place on srv133 for the last month.
  • 19:02 mark: Upgraded packages on search1 - search6 and searchidx1
  • 18:59 brion: aaron complaining of srv37 not properly updated (doesn't recognize Special:RatingHistory). flaggedrevs.php was out of date there. checking scap infrastructure, stuff seems ok so far...

October 7

  • 21:47 brion: started two dump threads (srv31)
  • 21:16 RobH: installed and configured gmond on all knams squids.
  • 21:00 jeluf: moved wikipedia/g* to ms1
  • 18:55 RobH: fixed private uploads issue for arbcom-en and wikimaniateam.
  • 17:26 RobH: reinstalled and redeployed knsq24 and knsq29
  • 15:00-16:00 robert: switched enwiki to lucene-search 2.1 running on new servers. Test run till tomorrow, if anything goes wrong, reroute search_pool_1 to old searchers on lvs3. Will switch on spell checking when all of the servers are racked. Thanks RobH for tunning config files.
  • 15:54 RobH: srv101 crashed again, running tests.
  • 15:45 RobH: srv146 was powered down for no reason. Powered back up.
  • 15:42 RobH: srv138 locked up, rebooted, back online.
  • 15:32 RobH: srv110 was locked up, rebooted, synced, back online.
  • 15:31 RobH: srv101 back up and synced.
  • 15:22 RobH: rebooted srv56, was locked up, handed off to rainman to finish repair.
  • 15:21 RobH: updated lucene.php and synced.
  • 15:04 RobH: updated memcached to remove srv110 and add in spare srv137.
  • 15:00 RobH: removed all servers from lvs:search_pool_1 and put in search1 and search2 with rainman

October 6

  • 23:55 brion: tweaked bugzilla to point rXXXX at CodeReview instead of ViewVC
  • 14:29 domas: amane lighty was closing connections immediately, worked properly after restart. upgraded to 1.4.20 on the way.
  • 14:36 RobH: setup ganglia on all pmtpa squids.
  • 13:50 mark: The slow page loading on the frontend squids appears to be limited to english main page only, for unknown reasons. Set another article as pybal check URL to prevent pooling/depooling oscillation by PyBal for now.
  • 09:27 mark: yaseo squids are fully in swap, set DNS scenario yaseo-down

October 5

  • 23:14 mark: Frontend squids are not working well at the moment, sometimes serving cached objects with very high delays. I wonder if they are under (socket) memory pressure. Reduced cache_mem on the backend instance on sq25 to free up some memory for testing.
  • 20:35 jeluf: wikipedia/b* moved, too
  • 19:00 jeluf: switched squids to send requests for upload.wikimedia.org/wikipedia/a* to ms1
  • 14:30 jeluf: Moving all wikipedia/a* image directories to ms1

October 4

  • 23:17 mark: Repooled knsq16-30 frontends in LVS. Also found that mint was fighting with fuchsia about being LVS master, due to reboot this afternoon.
  • 14:30 mark: Several servers in J-16 were shutting down, or going down around this time. Reason unknown, possibly auto shutdown because of high temperature, possibly they were turned off by someone locally.
  • 14:03 mark: SARA power failure. Feed B lost power for ~ 6 seconds.
  • 00:26 mark: Depooled srv61
  • 00:07 brion: found srv37 and srv61 have broken json_decode (wtf!)
    • updating packages on srv37. srv61 seems to have internal auth breakage
    • updated packages on srv61 too. su still borked, may need LDAP fix or something?

October 3

  • 21:40 brion: transferring old upload backups from storage2 to storage3. once complete, can restart dumps!
  • 20:01 brion: running updateRestrictions on all wikis (done)
  • 17:51 RobH: srv135 & srv136 reinstalled as ubuntu.
  • 17:34 RobH: srv132 & srv133 reinstalled as ubuntu.
  • 17:13 RobH: srv130 back online.
  • 16:40 RobH: depooled srv131, srv132, srv135, srv136 for reinstall.
  • 00:25 brion: switched codereview-proxy.wikimedia.org to use local SVN command instead of PECL SVN module; it seemed to be getting bogged down with diffs, but hard to really say for sure

October 1

  • 20:02 RobH: srv63 back online.
  • 19:35 RobH: srv61 and srv133 back online.
  • 18:22 RobH: storage3 online and handed off to brion.
  • 17:35 RobH: updated mc-pmtpa.php to put srv61 as spare.
  • 17:32 RobH: srv61 faulty fan replaced, back online.
  • 09:31 Tim: srv104 (cluster18) hit max_rows, finally. Removed it from the write list.
  • 08:36 Tim: fixed ipb_allow_usertalk default on all wikis
  • 23:46 mark: Reinstalled knsq24
  • 22:55 mark: Reenabled switchports of knsq16 - knsq30
  • 20:45 jeluf: fixed resolv.conf on srv131
  • 20:45 jeluf: mounted ms1:/export/upload as /mnt/upload5, started lighttpd on ms1
  • 19:47 brion: enabled revision deletion on test.wikipedia.org for some public testing.
  • 14:25 RobH: Cleaned out the squid cache on knsq16, knsq17, knsq18, knsq19, knsq21, knsq22, knsq23, knsq25, knsq26, knsq27, knsq28, knsq30. DRAC not responsive on knsq20, knsq24, knsq29.

Archives