You are browsing a read-only backup copy of Wikitech. The live site can be found at

Server admin log/2008-08: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
m (Bot: Fixing double redirect to Server Admin Log/Archive 12a)
(One intermediate revision by one other user not shown)
Line 1: Line 1:
== August 31 ==
#REDIRECT [[Server Admin Log/Archive 12a]]
* 23:05 mark: A parser bug in the PowerDNS Bind backend caused unavailability of the zone for a few minutes, ouch...
* 22:55 mark: Deployed a PowerDNS pipebackend instance with [ this script] on ([[lily]]) only. Just one out of three nameservers for stability testing for now. Should there be major trouble, remove all "pipe" backend references from <tt>/etc/powerdns/pdns.conf</tt>.
* 18:38 Tim: Going to bed. Status is: srv107 replicating but locked with slow alter table. Can be re-added after it catches up. cluster18 is working, for no apparent reason, and should be migrated to max_rows=20M ASAP. cluster17 needs a master switch so that srv102 can be fixed, after that it should be re-added to the write list. Once srv142 is done copying, it can be restarted and repooled, as can srv145. No need to fix the replication there since it's an old cluster.
* 18:30 Tim: re-adding cluster19 to the write list, without srv107 which is still altering.
* 16:22 Tim: srv141 didn't work out, out of disk space, trying copy to srv142 instead (from srv145)
* 14:44 Tim: srv103 and srv110 done, repooling.
* 14:02 Tim: srv108 done, changed master to srv108, started max_rows change on srv107
* 13:51 Tim: started max_rows change on srv110. Not patient enough to do them one at a time.
* 13:38 Tim: copy to srv110 finished. Put srv110 in, srv103 left out for now for max_rows change
* 13:27 Tim: taking srv145 out of rotation for copy to new ext store srv141 (has same partitioning)
* 12:45 Tim: srv109 finished, starting on srv108
* 11:45 Tim: taking srv103 out of rotation for copy to new ext store srv110
* 11:37 Tim: alter table blobs max_rows=10000000; on srv109.
* 11:34 Tim: cluster is too much of a mongrel undocumented mess to set up new ext store servers, and we don't have that many candidates left anyway. Going to try saving the existing clusters.
* 10:27 Tim: received reports that cluster19 has gone the same way. Most likely all slaves and masters set up that time are affected and will fail roughly simultaneously. Will set up new clusters.
* 10:15 Tim: set mysql root password on external storage servers where it was blank
* 10:07 Tim: cluster17 master srv102 has stopped being writable for enwiki due to exhausted MyISAM index table size (max_rows=1000000). Removed from write list, working on it.
* 07:00 Tim: On srv189: added to sources.list. Installed debug symbols for apache.
== August 30 ==
* 22:11 mark: Set up an experimental IPv6 to IPv4 proxy on [[iris]]
* 17:13 Tim: killed long-running convert processes on srv152-189
== August 29 ==
* 21:00 jeluf: checked srv104, added it back to its ES pool, added cluster18 back to wgDefaultExternalStore
* 16:12 RobH: moved [[srv52]] and [[srv56]] from B2 to C4 for heat issues.
* 15:32 RobH: [[srv149]] reinstalled as apache core.
* 13:08 Tim: images on kuwiki were actually broken because the move from amane to storage2 failed. The directory on amane was probably recreated by the thumbnail handler before the migration script created the symlink, resulting in a new writable image directory with no images in it. Merged the two directories and fixed the symlink.
* 12:00 domas: did space cleanups on amaryllis, and all DBs (all <80% disk usage now :) - preparing for vacation. VACATION!!! :)
== August 28 ==
* 22:50 mark: Set up a dirty, temporary test setup of PyBal on [[lvs2]] doing SSH logins on all apaches for health checking.
* 21:43 RobH: reinstalled [[srv134]] back online as apache core.
* 21:10 RobH: reinstalled [[srv130]] back online as apache core.
* 20:09 RobH: [[searchidx1]], [[search1]], [[search2]], [[search3]], [[search4]], [[search5]], [[search6]], & [[search7]] racked with remote management enabled.
* 16:09 RobH:  [[db9]] reinstalled for misc db role.
* 13:28 Tim: removed dkwiktionary and dkwikibooks from all.dblist. Apparently they're visible on the web when they were previously removed. They were created accidentally years ago due to dk being an alias for da.
** They became visible due to Rob's changes to langlist.
* 05:59 Tim: Following complaint about bad uploads on kuwiki, running "find -type d -not -perm 777 -exec chmod 777 {} \;" in various upload directories with various maxdepth options.
== August 27 ==
* 22:57 RobH: [[srv127]] reinstalled and back online as apache.
* 22:34 RobH: [[srv36]] reinstalled and back online as apache.
* 22:09 RobH: [[srv117]] reinstalled and back online as apache.
* 22:00 mark: Commented out most LVS related checks in <tt>/home/wikipedia/bin/apache-sanity-check</tt> which are no longer relevant
* 22:00 mark: Various changes to the Ubuntu installer, to make SM apache installs work, and for preseeding of NTP config.
* 21:48 RobH: [[srv81]] reinstalled and back online as apache.
* 19:07 RobH: Purged redirect from all knams squids.
* 18:10 RobH: [[srv147]] reinstalled and deployed as apache.
* 16:30 RobH: [[sq48]] had a possible issue with hdc.  Tested fine, cleaned and back online.
* 15:19 RobH: [[srv146]] was read-only.  Rebooted, fsck, restarted.
* 08:38 Tim: added FlaggedRevs stats update to crontab on hume
* 08:03 Tim: running FlaggedRevs/maintenance/updateLinks.php on dewiki
== August 26 ==
* 20:00 RobH: moved [[srv84]] and [[srv85]] from B4 to B3 rack.
* 18:39 RobH: moved [[srv82]] and [[srv83]] from B4 to B3 rack.
* 17:30 RobH: [[srv81]] reinstalled and running apache.  Needs ext store setup.
* 16:35 RobH: [[srv103]] restarted and synced.
* 16:01 brion: [[srv103]] serving pages with stale software but unreachable. needs to be shut down
* 14:53 RobH: reinstalled [[db10]] for misc. db tasks.
* 13:27 Tim: disabled some user account on otrs-wiki
* 11:15 mark: Added [[coronelli]] to search pool 3 on [[lvs3]]
* 00:26 RobH: fixed my own typo in redirects.conf, pushed, graceful all apache.
* 00:15 RobH: pushed some fixes on InitialiseSettings.php for a private wiki.
== August 25 ==
* 23:07 brion: enabled write API, let's see what happens!
* 22:41 brion: query.php disabled as scheduled.
* 22:07 brion: a SiteConfiguration code change broke upload dirs for a bit. reverted it.
* 20:15 brion: set wgNewUserSuppressRC to true, was false unsure why it's annoying
* 14:30 RobH: pushed dns changes to langlist to support cz. as well as a number of other langlist redirects not added to dns.
* 14:15 RobH: Fixed an error in my additions for the cz.wikistuff, pushed out the redirects to apaches.
* 12:10 domas: mark stealing db10 for ''stuff''
* 11:00 domas: reenabled db10, added db14 to s1, db9 given away to non-core tasks, added full contributions load to db16 (as it has covering index)
* 09:55 domas: reverted an instance where 'IndexPager' was causing filesorts... :)
* 08:00 domas: cleaned up hume / diskspace, was full, added /a to updatedb prunepaths, apt-get clean too - 4.5G released
* 08:00 domas: disabled db10 for db14 bootstrap
* 07:36 domas: updating FlaggedRevs schema on ruwiki.
* 02:26 brion: updating MW, including FlaggedRevs schema update (fp_pending_since, flaggedrevs_tracking)
== August 24 ==
* 17:15 domas: removing db9 entirely, crashed, disk gone...
* 07:20 Tim: deployed the TrustedXFF extension that I just wrote.
* 02:56 Tim: removed db9 from the contributions, watchlist and recentchangeslinked query groups. Long running queries (2000 seconds) from IndexPager::reallyDoQuery and ApiQueryContributions::execute, probably needs index fixes. Removed general load from the remaining query group server, db7.
== August 22 ==
* 21:34 RobH: [[will]] moved from A4 to A2.
* 21:00 RobH: [[diderot]] unracked
* 00:27 brion: FR feedback on on enwikinews as well
* 00:24 brion: Deleting email record rows from cu_changes; some had slipped through before we disabled the privacy breakage
== August 21 ==
* 23:47 brion: FlaggedRevs feedback enabled on test & labs
* 23:35 brion: Enabled experimental HTML diff on,, and
* 18:17 RobH: Updated DNS entries to add a number of .cz domains.  Also updated redirects.conf to support the added domains.
* 11:43 Tim: installing GlobalBlocking
* 02:42 Tim: returned db16 to general load, a less critical role
* 02:30 Tim: installed mysql-client-5.0 on db11-16. Installed ganglia-metrics on thistle, db1, db4, db7, db12, db13, db14, db15, db16.
* 02:20 Tim: offloaded query group read load from db16. System+user CPU disappeared.
** Recovery spike in I/O shows that replication was suppressed due to read activity. Caught up in ~8 minutes.
* 02:11 Tim: db16 is chronically lagged, probably overloaded with inflexible query group load
** db16 shows high flat system+user CPU since ~01:05
== August 20 ==
* 04:15 Tim: attempting to upgrade hume from Ubuntu 7.10 to 8.04
* 01:24 brion: experimentally lifting $wgExportMaxLimit from 1000 to infinity on enwiki -- testing hack to SpecialExport.php to use unbuffered query
== August 19 ==
* 08:38 Tim: done with lomaria
* 07:42 Tim: taking lomaria out of rotation to drop non-s2a databases and change its replication to s2a-only.
* 04:45 Tim: increased load on db13 to relieve db8, stressed by removal of lomaria from s2
* 04:10 Tim: A hotlinking mirror, getting images from thumb.php, was being visited at high rate, DoSing our storage servers. Referer blocked.
* 03:50 Tim: ixia disk space critical, fixed
* 03:45 Tim: Older s3 slave servers are showing signs of strain. Adding more s3 load to db11 to test its capacity.
** db11 is fine at 47% load ratio, reporting 80-90% disk util, await 5-7ms, load ~6
** 96% load ratio, reporting disk util ~90%, await ~6ms, load ~7.5. Wait CPU ~12%. Yawning in mock-boredom.
* 03:37 Tim: lomaria was relatively overloaded. Adjusted loads, put it in an s2a role since we haven't had any s2a servers since holbach was decommissioned
* 02:40 Tim: removed holbach, webster and bacon from db.php, decomissioned. Removed decomissioned servers from $wgSquidServersNoPurge.
* 02:27 Tim: compiled [[UDP based profiling|udpprofile]] on zwinger, started collector. Firewalled port 3811 inbound, /etc/init.d/iptables save. Updated MediaWiki configuration. Updated on bart.
* 01:40 Tim: reduced apache "TimeOut" on srv38/39 from 300 to 10, to limit the impact of LVS flapping
== August 18 ==
* 23:00 RobH: added the image scaling servers back into the apache node group and updated their config files.  This fixes the thumbnail generation issue evident on both uploads. and se.wikimedia (may have existed elsewhere as well, in fact, it most certainly must have.)  All apaches restarted.
== August 17 ==
* 22:30 jeluf: restarted apaches on srv38/39 due to user reports about broken thumbnails.
== August 16 ==
* 13:20 mark: Reenabled ProxyFetch monitor on rendering cluster on [[lvs3]], and set <tt>depool_threshold = .5</tt>.
* 12:58 Tim: removed ProxyFetch monitor from rendering cluster in pybal on lvs3
* 12:50 Tim: thumbnailing broke completely, at ~03:00 UTC. The apache processes on srv38/39 were stuck waiting for writes to the storage servers. Couldn't find the associated PHP threads on the storage servers to see if something was holding them up, so I tried restarting apache on srv38/39 instead. Suspect broken connections due to regular depooling by pybal
== August 14 ==
* 18:55 domas fixed db16 replication
* 18:50 brion: [[db16]] replication is broken -- contribs/watchlists/recentchangeslinked for enwiki stopped at about 4 hours ago
* ??? ??? db16 crashed
== August 13 ==
* 17:10 Tim: Changed to use a PHP script to highlight the source files from NFS on request, instead of them being updated periodically. Added a warning header to all affected files.
* 06:17 Tim: Removed old ExtensionDistributor snapshots (find -mtime +1 -exec rm {} \;), synced [[rev:39273|r39273]]
* 02:40 brion: fixed permissions on dewiki thumb dir -- root-owned directory not writable by apache worked for existing directories, but failed for the 'archive' directory needed for old-version thumbnails used by FlaggedRevs
== August 12 ==
* 21:06 mark: Moved LVS load balancing of apaches to [[lvs3]] as well, using a new service IP (<tt></tt>)
* 18:10 brion: fixed up security config that disabled PHP execution in extension directories; several configs had this wrong and non-functional
* 12:45 tfinc: removed /srv/  & /srv/org.wikimedia.donate on [[srv9]] and removed the apache confs that mention them.
== August 11 ==
* 23:53 mark: Moved traffic from Russia (iso code 643) to knams
* 23:53 mark: Moved the rendering cluster LVS to [[lvs3]] as well.
* 22:45 mark: Deployed [[lvs3]] as the first new internal [[LVS]] cluster host, and moved over the search pools to it using ''new service IPs'' (outside the subnet). The rest of the LVS cluster as well as the documentation are a work in progress - let me know if there are any problems.
== August 10 ==
* 17:43 Tim: freed up another 100GB or so by deleting all dumps from February 2008.
* 17:27 Tim: freed up a few GB on storage2 by deleting failed dumps: enwiki/{20080425,20080521,20080618,20080629}, dewiki/20080629.
== August 8 ==
* 22:46 RobH: setup network access LOM for [[db13]], [[db14]], [[db15]], & [[db16]]
* 22:40 brion: set up 'inactive' group on private wikis; this is just "for show" to indicate disabled accounts, adding a user to the group doesn't actually disable them :)
* 21:15 brion: can't seem to reach the 'oai' audit database on adler from the wiki command-line scripts. This is rather annoying; permissions wrong maybe?
== August 6 ==
* 17:25 brion: updated [ dump index page] to indicate dumps are halted atm
== August 5 ==
* 22:09 mark: Shutdown BGP session to XO for maintenance
* 18:27 RobH: [[db14]], [[db15]], [[db16]] installed with Ubuntu.
* 18:24 brion: enabling flaggedrevs on ruwiki per []
* 17:09 brion: enabling flaggedrevs on enwikinews per []
* 6:20 jeluf: set wgEnotifUserTalk to true on all but the top wikis, see [ bugzilla]
== August 4 ==
* 05:58 brion: dewiki homepage broken for a few minutes due to a bogus i18n update in imagemap breaking the 'desc' alignment options
== August 3 ==
* 14:15 robert: got reports about lots of failed searches on nl and, looks like diderot (again) failed to depool a dead server (rabanus), removed manually.
== August 1 ==
* 21:05 brion: forcing display_errors on for CLI so I don't keep discovering my command-line scripts are broken _after_ I run them, they don't show any errors, and I thought they worked. :)
* 06:39 Tim: wrote a PHP syntax check for scap, using parsekit, that runs about 6 times faster than the old one
* 04:58 Tim: installing PHP on suda (CLI only) for syntax check speed test
* 01:46 Tim: removed db1 from rotation, it's stopped in gdb at a segfault.
* 00:22 brion: aha! found the problem. MaxClients was turned down to 10 from default of 150 long ago, while the old prefix search was being tested. :) now back to 150
* 00:19 brion: just turning off the mobile gateway on yongle for now, it just doesn't appear to be working at full load. (files moved to subdir -- in /x/ it works fine seemingly). Server doesn't appear overly loaded -- CPU and load are low -- just the requests stick.
* 00:10 brion: installing APC on [[yongle]], php bits are ungodly slow sometimes

Latest revision as of 04:00, 15 May 2022