You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
XFS and our servers
This page serves as a description of how we format our xfs partitions and why.
How they're formatted
root@db1047:/a/sqldata# xfs_info /dev/sda6 meta-data=/dev/sda6 isize=256 agcount=4, agsize=109108672 blks = sectsz=512 attr=2 data = bsize=4096 blocks=436434688, imaxpct=5 = sunit=64 swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal bsize=4096 blocks=213120, version=2 = sectsz=512 sunit=64 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
How they get formatted
fenari:/home/midom/xfsfix is some python that gets the right device and UUID names, then spits out executable bash code that looks something like this:
root@db1047:/a/sqldata# python /root/xfsfix umount /dev/sda6 mkfs.xfs -f -d sunit=512,swidth=4096 -L data /dev/sda6 xfs_admin -U f1363f7d-8a44-4abe-9e38-bf2171e265c8 /dev/sda6 mount /dev/sda6
You'll notice that the sunit and swidth numbers put out by the script don't match what xfs_info prints out. Domas guesses that the script's numbers are too large, but states that the numbers printed by xfs_info are acceptable as is.
Why they're formatted this way
[11:55 AM] <domas> 70U [11:56 AM] <maplebed> so domas any thoughts on that pastebin? [11:57 AM] <domas> hmmmm [11:57 AM] <domas> *shrug* [11:57 AM] • domas looks some more [11:57 AM] <maplebed> Jeff_Green notices that the sunit=64 and swidth=512 is also present on db26 [11:57 AM] <domas> not on other machines? [11:57 AM] <maplebed> (in the quest to see what's "right" that seems like a good place to start) [11:57 AM] <domas> db26 is LVM [11:57 AM] <maplebed> I haven't looked at other machines yet. [11:58 AM] <maplebed> db42 is the same... [11:59 AM] <domas> I guess I just have too high numbers there [11:59 AM] <domas> it is not in bytes but in 512b sectors [12:00 PM] <maplebed> not blocks? (which are set to 4096)? [12:01 PM] <domas> pain oh pain [12:01 PM] <Jeff_Green> ha [12:01 PM] <domas> 'sectors' is usually in 512 in linux [12:02 PM] <domas> 512*512 is 256k alignment [12:02 PM] <maplebed> at any rate, I've got to run; if you think the current settings are fine I'll update the docs. [12:02 PM] <domas> they are good enough [12:02 PM] <domas> 32k alignment is good enough too [12:02 PM] <domas> the major thing is not to have 16k partitioned [12:02 PM] <domas> meh, we're talking 10% perf here [12:03 PM] <domas> and we're not overloading i/o anyway [12:03 PM] <Jeff_Green> domas: could you email/wiki/something us some notes on your tweaks? [12:03 PM] <domas> jeff_green: there're not that many! [12:03 PM] <domas> but I can try! [12:03 PM] <Jeff_Green> i saw we tweak raid-related stripe stuff only? [12:04 PM] <Jeff_Green> at CL we ended up tweaking only agcount (to 32) and the usual mount options, curious what/why you tweak [12:04 PM] <domas> jeff_green a/g doesn't matter much, we have just one file that is big enough =) [12:05 PM] <domas> jeff_green: stripe alignment is to avoid multiple reads for one block [12:08 PM] <Jeff_Green> how does that interact with striping in hardware RAID? [12:10 PM] <domas> well [12:10 PM] <domas> if you don't align files on stripe boundaries [12:10 PM] <domas> if a file is made out of 16k blocks [12:10 PM] <domas> and you have 64k stripe [12:10 PM] <domas> and it is not aligned [12:11 PM] <domas> so, 25% of blocks will need two I/Os instead of one [12:11 PM] <domas> because the block will reside on two disks [12:11 PM] <domas> now, if you align them all on stripe boundary, all blocks are residing just on one disk [12:11 PM] <domas> (I'm not counting mirrors) [12:12 PM] <domas> back in the day it was much more painful, as we had to align partitions too [12:12 PM] <domas> (which is what xfsfix was mostly for) [12:12 PM] <domas> we were editing partitiontable with xfsfix before [12:12 PM] <Jeff_Green> ok. I'm going to apply this to db1040 as a comprehension exercise