You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Analytics/Systems/Cluster/Hadoop/Administration: Difference between revisions
imported>Elukey |
imported>Elukey No edit summary |
||
Line 303: | Line 303: | ||
apt-get install parted | apt-get install parted | ||
We want a journal node lvm partition on every node so we can move the journal daemons around if needed:<source lang="bash"> | |||
<source lang="bash"> | |||
#!/bin/bash | #!/bin/bash | ||
set -e | set -e |
Revision as of 12:06, 19 November 2018
NameNodes
New NameNode Installation
Our Hadoop NameNodes are R420 boxes. It was difficult (if not impossible) to create a suitable partman recipe for the partition layout we wanted. The namenode partitions were created manually during installation.
These nodes have 4 disks. We are mostly concerned with reliability of these nodes. The 4 disks were assembled into a single software RAID 1 array:
$ mdadm --detail /dev/md0 /dev/md0: Version : 1.2 Creation Time : Wed Jan 7 21:10:18 2015 Raid Level : raid1 Array Size : 2930132800 (2794.39 GiB 3000.46 GB) Used Dev Size : 2930132800 (2794.39 GiB 3000.46 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Wed Jan 14 21:42:20 2015 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Name : analytics1001:0 (local to host analytics1001) UUID : d1e971cb:3e615ace:8089f89e:280cdbb3 Events : 100 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 18 1 active sync /dev/sdb2 2 8 34 2 active sync /dev/sdc2 3 8 50 3 active sync /dev/sdd2
LVM md0 with a single volume group was then added onto md0. Two logical volumes were then added for / root and for /var/lib/hadoop/name Hadoop NameNode partition.
$ cat /etc/fstab /dev/mapper/analytics--vg-root / ext4 errors=remount-ro 0 1 /dev/mapper/analytics--vg-namenode /var/lib/hadoop/name ext3 noatime 0 2
High Availability
We use automatic failover between active and standby NameNodes. This means that upon start, all NameNode processes interacts with the hadoop-hdfs-zkfc daemon that will use ZooKeeper to establish master and standby node automatically. Restarting the active Namenode will be handled gracefully promoting the standby to active.
Note that you need to use the logical name of the NameNode in hdfs haadmin commands, not the hostname. The puppet-cdh module uses the fqdn of the node with dots replaced with dashes as the logical NameNode names.
ResourceManager
HA and automatic failover is set up for YARN ResourceManager too (handled by Zookeeper). hadoop-yarn-resourcemanager runs on both master hosts, with analytics1001 the primary active host. You may stop this service on either node, as long as at least one is running.
From 1002:
sudo service hadoop-yarn-resourcemanager restart
Manual Failover
To identify the current name node machines, read puppet configs (analytics100[12] at moment of writing) and run:
elukey@analytics1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState analytics1001-eqiad-wmnet active elukey@analytics1001:~$ sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState analytics1002-eqiad-wmnet standby elukey@analytics1001:~$ sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState analytics1001-eqiad-wmnet active elukey@analytics1001:~$ sudo -u hdfs /usr/bin/yarn rmadmin -getServiceState analytics1002-eqiad-wmnet standby
If you want to move the active status to a different NameNode, you can force a manual failover:
sudo -u hdfs /usr/bin/hdfs haadmin -failover analytics1001-eqiad-wmnet analytics1002-eqiad-wmnet
(That command assumes that analytics1001-eqiad-wmnet is the currently active and should become standby, while analytics1002-eqiad-wmnet is standby and should become active)
YARN ResourceManager and HDFS Namenode are configured to automatically failover between the two master hosts. You don't need to manually fail it over. You can just stop the hadoop-yarn-resourcemanager and hadoop-hdfs-namenode services. After you finish your maintenance, you should do this again to ensure that the active ResourceManager is analytics1001. The HDFS Namenode can be switched using the failover command described above, no need for unnecessary restarts.
Migrating to new HA NameNodes
This section will describe how to make an existent cluster use new hardware for new NameNodes. This will require a full cluster restart.
![]() | This section might be out of data since we enabled automatic of the master HDFS Namenode via Zookeeper. Please double check this procedure and update it if needed! |
Put new NameNode(s) into 'unknown standby' mode
HDFS HA does not allow for hdfs-site.xml to specify more than two dfs.ha.namenodes at a time. HA is intended to only work with a single standby NameNode.
An unknown standby NameNode (I just made this term up!) is a standby NameNode that knows about the active master and the JournalNodes, but that is not known by the rest of the cluster. That is, it will be configured to know how to read edits from the JournalNodes, but it will not be a fully functioning standby NameNode. You will not be able to promote it to active. Configuring your new NameNodes as unknown standbys allows them to sync their name data from the JournalNodes before shutting down the cluster and configuring them as the new official NameNodes.
Configure the new NameNodes exactly as you would have a normal NameNode, but make sure that the following properties only set the current active NameNode and the hostname of the new NameNode. In this example, let's say you have NameNodes nn1 and nn2 currently operating, and you want to migrate them to nodes nn3 and nn4.
nn3 should have the following set in hdfs-site.xml
<property>
<name>dfs.ha.namenodes.cluster-name</name>
<value>nn1,nn3</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster-name.nn1</name>
<value>nn1:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.cluster-name.nn3</name>
<value>nn3:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster-name.nn1</name>
<value>nn1:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.cluster-name.nn3</name>
<value>nn3:50070</value>
</property>
Note that there is no mention of nn2 in hdfs-site.xml on nn3. nn4 should be configured the same, except with reference to nn4 instead of nn3.
Once this is done, you can start hadoop-hdfs-namenode on nn3 and nn4 and bootstrap them. This is done on both new nodes.
sudo -u hdfs hdfs namenode -bootstrapStandby
service hadoop-hdfs-namenode start
NOTE: If you are using the puppet-cdh module, this will be done for you. You should probably just conditionally configure your new NameNodes differently than the rest of your cluster and apply that for this step, e.g.:
namenode_hosts => $::hostname ? {
'nn3' => ['nn1', 'nn3'],
'nn4' => ['nn1', 'nn4'],
default => ['nn1', 'nn2'],
}
Put all NameNodes in standby and shutdown Hadoop.
Once your new unknown standby NameNodes are up and bootstrapped, transition your active NameNode to standby so that all writes to HDFS stop. At this point all 4 NameNodes will be in sync. Then shutdown the whole cluster:
sudo -u hdfs hdfs haadmin -transitionToStandby nn1
# Do the following on every Hadoop node.
# Since you will be changing global configs, you should also
# shutdown any 3rd party services too (e.g. Hive, Oozie, etc.).
# Anything that has a reference to the old NameNodes should
# be shut down.
shutdown_service() {
test -f /etc/init.d/$1 && echo "Stopping $1" && service $1 stop
}
shutdown_hadoop() {
shutdown_service hue
shutdown_service oozie
shutdown_service hive-server2
shutdown_service hive-metastore
shutdown_service hadoop-yarn-resourcemanager
shutdown_service hadoop-hdfs-namenode
shutdown_service hadoop-hdfs-httpfs
shutdown_service hadoop-mapreduce-historyserver
shutdown_service hadoop-hdfs-journalnode
shutdown_service hadoop-yarn-nodemanager
shutdown_service hadoop-hdfs-datanode
}
shutdown_hadoop
Reconfigure your cluster with the new NameNodes and start everything back up.
Next, edit hdfs-site.xml everywhere, and anywhere else that there was a mention of nn1 or nn2 and replace them with nn3 nn4. If you are moving yarn services as well, now is a good time to reconfigure them with the new hostnames as well.
Restart your JournalNodes with the new configs first. Once that is done, your can start all of your cluster services back up in any order. Once everything is back up, transition your new primary active NameNode to active:
sudo -u hdfs hdfs haadmin -transitionToActive nn1
Name Node Administration UI
See Analytics/Cluster/Access for instructions on setting up a SOCKS proxy to connect to these internally-hosted web interfaces.
- http://analytics1001.eqiad.wmnet:8088/ - Hadoop YARN Job Browser (also available to all the wmf employees/contractors via yarn.wikimedia.org)
- http://analytics1001.eqiad.wmnet:50070/ - Hadoop Distributed File System NameNode Status
JournalNodes
JournalNodes should be provisioned in odd numbers 3 or greater. These should be balanced across rows for greater resiliency against potential datacenter related failures.
Adding a new JournalNode to a running HA Hadoop Cluster
See https://groups.google.com/a/cloudera.org/d/msg/cdh-user/M4OvE-oimOk/dyGfSoI_3TgJ
1. Create journal node LVM volume and partition
See the section below about new worker node installation.
2. Copy the journal directory from an existing JournalNode.
# shut down one of your existing JournalNodes
service hadoop-hdfs-journalnode stop
# copy the journal data over
rsync -avP /var/lib/hadoop/journal/ newhost.eqiad.wmnet:/var/lib/hadoop/journal/
# Start the JournalNode back up
service hadoop=hdfs-journalnode start
3. Puppetize and start the new JournalNode. Now puppetize the new node as a JournalNode. In role/analytics/hadoop.pp edit the role::analytics::hadoop::production class and add this node's fqdn to the $journalnode_hosts array.
$journalnode_hosts = [ 'analytics1011.eqiad.wmnet', # Row A2 'analytics1014.eqiad.wmnet', # Row C7 'oldnode.eqiad.wmnet', 'newnode.eqiad.wmnet', # Brand new row! ]
Run puppet on the new JournalNode. This should install and start the JournalNode daemon.
4. Apply puppet on NameNodes and restart them.
Once the new JournalNode is up and running, we need to reconfigure and restart the NameNodes to get them recognize the new JournalNode.
Run puppet on each of your NameNodes. The new JournalNode will be added to the list in dfs.namenode.shared.edits.dir.
Restart NameNode on your standby NameNodes first. Once that's done, check their dfshealth status to see if the new JournalNode registers. Once all of your standby NameNodes see the new JournalNode, go ahead and promote one of them to active so we can restart the primary active NameNode:
sudo -u hdfs /usr/bin/hdfs haadmin -failover analytics1001-eqiad-wmnet analytics1002-eqiad-wmnet
Note that you need to use the logical name of the NameNode here, not the hostname. The puppet-cdh module uses the fqdn of the node with dots replaced with dashes as the logical NameNode names.
Once you've moved failed over to a different active NameNode, you may restart the hadoop-hdfs-namenode on your usual active NameNode. Check the dfshealth status page again and be sure that the new JournalNode is up and writing transactions.
Cool! You'll probably want to promote your usual NameNode back to active:
sudo -u hdfs /usr/bin/hdfs haadmin -failover analytics1002-eqiad-wmnet analytics1001-eqiad-wmnet
Done!
Logs
There are many pieces when it comes to who manages logs in hadoop, best place we found that describes this is here: http://hortonworks.com/blog/simplifying-user-logs-management-and-access-in-yarn/
The important things:
- Yarn aggregate job log retention in specified on yarn-site.xml
- If Yarn aggregate job log retention changes the JobHistoryServer needs to be restarted (that is hadoop-mapreduce-historyserver in our system)
- Do this if you're looking for logs:
yarn logs -applicationId <appId>
. This will not work until your job is finished or dies. Also, if you're not the app owner, dosudo -u hdfs yarn logs -applicationId <appId> --appOwner APP_OWNER
Worker Nodes (DataNode & NodeManager)
New Worker Installation (12 disk, 2 flex bay drives - analytics1028-analytics1077)
These nodes come with 2 x 2.5" drives on which the OS and JournalNode partitions are installed. This leaves all of the space on the 12 4TB HDDs for DataNode use.
Device | Size | Mount Point |
---|---|---|
/dev/mapper/analytics1029--vg-root (LVM on /dev/sda5 Hardware RAID 1 on 2 flex bays) | 30 G | / |
/dev/mapper/analytics1029--vg-journalnode (LVM on /dev/sda5 Hardware RAID 1 on 2 flex bays) | 10G | /var/lib/hadoop/journal |
/dev/sda1 | 1G | /boot |
/dev/sdb1 | Fill Disk | /var/lib/hadoop/data/b |
/dev/sdc1 | Fill Disk | /var/lib/hadoop/data/c |
/dev/sdd1 | Fill Disk | /var/lib/hadoop/data/d |
/dev/sde1 | Fill Disk | /var/lib/hadoop/data/e |
/dev/sdf1 | Fill Disk | /var/lib/hadoop/data/f |
/dev/sdg1 | Fill Disk | /var/lib/hadoop/data/g |
/dev/sdh1 | Fill Disk | /var/lib/hadoop/data/h |
/dev/sdi1 | Fill Disk | /var/lib/hadoop/data/i |
/dev/sdj1 | Fill Disk | /var/lib/hadoop/data/j |
/dev/sdk1 | Fill Disk | /var/lib/hadoop/data/k |
/dev/sdl1 | Fill Disk | /var/lib/hadoop/data/l |
/dev/sdm1 | Fill Disk | /var/lib/hadoop/data/m |
The analytics-flex.cfg partman recipe will have created the root and swap logical volumes. We need to create the JournalNode and DataNode partitions manually. Copy/Pasting the following commands should do this for you:
You'll need parted installed, so first do
apt-get install parted
We want a journal node lvm partition on every node so we can move the journal daemons around if needed:
#!/bin/bash
set -e
set -x
# Create a logical volumne for JournalNode data.
# There should only be one VG, look up its name:
vgname=$(vgdisplay -C --noheadings -o vg_name | head -n 1 | tr -d ' ')
lvcreate -n journalnode -L 10G $vgname
# make an ext4 filesystem
mkfs.ext4 /dev/$vgname/journalnode
# Don't reserve any blocks for OS on this partition.
tune2fs -m 0 /dev/$vgname/journalnode
mount_point=/var/lib/hadoop/journal
mkdir -pv $mount_point
grep -q $mount_point /etc/fstab || echo -e "# # Hadoop JournalNode partition\n/dev/$vgname/journalnode\t${mount_point}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab
mount -v $mount_point
IMPORTANT: DO NOT PROCEED FURTHER IF YOU WANT TO PRESERVE THE DATANODE PARTITIONS, PLEASE CHECK THE NEXT SECTION! Now make ext4 filesystems on each DataNode partition. These disks are larger than 2TB, so we must use parted to create a GUID partition table on them:
#!/bin/bash
set -e
set -x
for disk_letter in b c d e f g h i j k l m; do
disk=/dev/sd${disk_letter}
parted ${disk} --script mklabel gpt
parted ${disk} --script mkpart primary ext4 0% 100%
partition=${disk}1
mkfs.ext4 $partition &
done
And then finally make the tune2fs adjustments and mount all the new partitions: IMPORTANT: Wait for all the above ext4 filesystems to be formatted before running the following loop.
#!/bin/bash
set -e
set -x
data_directory=/var/lib/hadoop/data
for disk_letter in b c d e f g h i j k l m; do
partition_number=1
partition="/dev/sd${disk_letter}${partition_number}"
mount_point="${data_directory}/${disk_letter}"
# Don't reserve any blocks for OS on these partitions.
tune2fs -m 0 $partition
# Make the mount point.
mkdir -pv $mount_point
# add it to fstab unless it is already there
grep -q $mount_point /etc/fstab || (
uuid=$(blkid | grep primary | grep ${partition} | awk '{print $2}' | sed -e 's/[:"]//g')
echo -e "# Hadoop DataNode partition ${disk_letter}\n${uuid}\t${mount_point}\text4\tdefaults,noatime\t0\t2" | tee -a /etc/fstab
)
mount -v $mount_point
done
Check that all the megacli virtual disks have the following config:
$ megacli -LDPDInfo -aAll | grep "Current Cache Policy" | sort| uniq -c
13 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
$ megacli -AdpBbuCmd -a0 | grep Auto-Learn
Auto-Learn Mode: Disabled
In case they don't:
# ReadAhead Adaptive
megacli -LDSetProp ADRA -LALL -aALL
# Direct (No cache)
megacli -LDSetProp -Direct -LALL -aALL
# No write cache if bad BBU
megacli -LDSetProp NoCachedBadBBU -LALL -aALL
# Disable BBU auto-learn
echo "autoLearnMode=1" > /tmp/disable_learn && megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0
Worker Reimage (12 disk, 2 flex bay drives - analytics1028-analytics1057)
In this case only the Operating system is reinstalled, so there is no need to wipe the HDFS datanode partitions (/dev/sdX1). You can follow the same procedure described in the above section for the root partition, and then proceed with the following one for the datanode partitions.
Before starting, disable puppet! This should prevent any Hadoop daemon to execute before the datanode partions are in a good state.
Journalnode reimage, IMPORTANT! The Journal node daemon needs to have its files under the /var/lib/hadoop/journal directory otherwise it will start logging errors stating that it is not formatted correctly. One trick is to save the journal node directory before the reimage in one of the data node dirs (that will not be formatted) and the restore it later on. Otherwise, simply copy the journal node directory from another host.
With the analytics-flex.cfg partman recipe the partitions might have have got their type changed, so a quick check is always good:
elukey@analytics1041:~$ sudo fdisk -l
Disk /dev/sdb: 3.7 TiB, 4000225165312 bytes, 7812939776 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: BFB77880-A74E-4E40-B898-15F4EEA4F39A
Device Start End Sectors Size Type
/dev/sdb1 2048 7812937727 7812935680 3.7T Linux filesystem
Disk /dev/sdc: 3.7 TiB, 4000225165312 bytes, 7812939776 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 25C834AE-3BFB-4B75-AECD-BC65DBD1F66F
[..]
If you don't see "Linux filesystem" as Type, please check that the current partition is ext4 with:
elukey@analytics1041:~$ sudo blkid /dev/sdb1
/dev/sdb1: UUID="632ee189-8df6-4e89-b589-0c6d60192521" TYPE="ext4" PARTLABEL="primary" PARTUUID="a462801a-ffb2-4e8a-82c5-e6c7a999830a"
If this is the case, you'll need to change the partition type manually with a tool like gdisk:
root@analytics1041:/home/elukey# gdisk /dev/sdb
GPT fdisk (gdisk) version 0.8.10
Partition table scan:
MBR: protective
BSD: not present
APM: not present
GPT: present
Found valid GPT with protective MBR; using GPT.
Command (? for help): t
Using 1
Current type is 'Microsoft Data'
Hex code or GUID (L to show codes, Enter = 8300):
Changed type of partition to 'Linux filesystem'
Command (? for help): w
Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!
Do you want to proceed? (Y/N): y
One command to bulk change all the disks:
for el in b c d e f g h i j k l m; do echo -e "t\n\nw\ny\n" | gdisk /dev/sd${el}; done
Then you'll need to add the partitions to the fstab with UUIDs. The following script will emit the chunk of text that you need to add to the fstab
sudo blkid | grep primary | awk '{print $2" "$1}' | sed -e 's/[:"]//g' | while read uuid partition;
do
letter=$(echo $partition| awk -F 'sd|1' '{print $2}');
echo -e "# Hadoop datanode $letter partition\n$uuid\t/var/lib/hadoop/data/${letter}\text4\tdefaults,noatime\t0\t2";
done
Then execute mount -a
and you'll be able to see the datanode partitions mounted!
Create the datanode partition directories manually:
for el in b c d e f g h i j k l m; do mkdir -p /var/lib/hadoop/data/${el}; done
Last but not the least, you'll need to make sure that the correct uid/gid are set for the datanode partitions' inodes (since we reinstalled the OS):
for letter in $(ls /var/lib/hadoop/data); do
sudo chown -Rv yarn:yarn /var/lib/hadoop/data/${letter}/yarn &
sudo chown -Rv hdfs:hdfs /var/lib/hadoop/data/${letter}/hdfs &
done
Final step, re-enable puppet and execute it!
Check that all the megacli virtual disks have the following config:
$ megacli -LDPDInfo -aAll | grep "Current Cache Policy" | sort| uniq -c
13 Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
$ megacli -AdpBbuCmd -a0 | grep Auto-Learn
Auto-Learn Mode: Disabled
In case they don't:
# ReadAhead Adaptive
megacli -LDSetProp ADRA -LALL -aALL
# Direct (No cache)
megacli -LDSetProp -Direct -LALL -aALL
# No write cache if bad BBU
megacli -LDSetProp NoCachedBadBBU -LALL -aALL
# Disable BBU auto-learn
echo "autoLearnMode=1" > /tmp/disable_learn && megacli -AdpBbuCmd -SetBbuProperties -f /tmp/disable_learn -a0
Decommissioning
To decommission a Hadoop worker node, you will need to edit hosts.exclude on all of the NameNodes, and then tell both HDFS and YARN to refresh the list of nodes.
Edit /etc/hadoop/conf/hosts.exclude on all NameNodes (analytics1001 and analytics1002 as of 2015-07) and add the FQDN of each node you intend to decommission, one hostname per line. NOTE: YARN and HDFS both use this file to exclude nodes. It seems that HDFS expects hostnames, and YARN expects FQDNs. To be safe, you can put both in this file, one per line. E.g if you wanted to decommission analytics1012:
analytics1012.eqiad.wmnet analytics1012
Once done, run hdfs dfsadmin -refreshNodes command for each NameNode FS URI:
sudo -u hdfs hdfs dfsadmin -fs hdfs://analytics1010.eqiad.wmnet:8020 -refreshNodes sudo -u hdfs hdfs dfsadmin -fs hdfs://analytics1009.eqiad.wmnet:8020 -refreshNodes
Run this on each ResourceManager host:
sudo -u hdfs yarn rmadmin -refreshNodes
Now check both YARN and HDFS Web UIs to double check that your node is listed as decommissioning for HDFS, and not listed in the list of active nodes for YARN:
- http://analytics1001.eqiad.wmnet:50070/dfsnodelist.jsp?whatNodes=DECOMMISSIONING
- http://yarn.wikimedia.org/cluster/nodes
Balancing HDFS
Over time it might occur that some datanodes have almost all of their disk space for HDFS filled up, while others have lots of free HDFS disk space. hdfs balance
can be used to re-balance HDFS.
In March 2015 ~0.8TB/day needed to be moved to keep HDFS balanced. (On 2015-03-13T04:40, a balancing run ended and HDFS was balanced. And on 2015-03-15T09:10 already 1.83TB needed to get balanced again)
hdfs balancer is now run regularly as a cron job, so hopefully you won't have to do this manually.
Checking balanced-ness on HDFS
To see how balanced/un-balanced HDFS is, you can run sudo -u hdfs hdfs dfsadmin -report
on stat1007
. This command will output DFS Used%
per data node. If that number is equal for each data node, disk space utilization for HDFS is proportionally alike for the whole cluster. If DFS Used%
numbers do not align, the cluster is somewhat misbalanced; Per default hdfs balance
considers a cluster balanced once each datanodes are within 10% points (in terms of DFS Used%
) of the total DFS Used%
).
Here's a pipeline that'll format a report for you:
_________________________________________________________________ qchris@stat1007 // jobs: 0 // time: 09:40:52 // exit code: 0 cwd: ~ sudo -u hdfs hdfs dfsadmin -report | cat <(echo "Name: Total") - | grep '^\(Name\|Total\|DFS Used\)' | tr '\n' '\t' | sed -e 's/\(Name\)/\n\1/g' | sort --field-separator=: --key=5,5n Name: Total DFS Used: 479802829437520 (436.38 TB) DFS Used%: 56.48% Name: 10.64.53.16:50010 (analytics1037.eqiad.wmnet) DFS Used: 24728499759105 (22.49 TB) DFS Used%: 52.34% Name: 10.64.53.15:50010 (analytics1036.eqiad.wmnet) DFS Used: 24795956362795 (22.55 TB) DFS Used%: 52.48% Name: 10.64.36.130:50010 (analytics1030.eqiad.wmnet) DFS Used: 24824351783406 (22.58 TB) DFS Used%: 52.54% Name: 10.64.53.20:50010 (analytics1041.eqiad.wmnet) DFS Used: 24841574653962 (22.59 TB) DFS Used%: 52.58% Name: 10.64.53.19:50010 (analytics1040.eqiad.wmnet) DFS Used: 24904078674773 (22.65 TB) DFS Used%: 52.71% Name: 10.64.36.128:50010 (analytics1028.eqiad.wmnet) DFS Used: 24996433223579 (22.73 TB) DFS Used%: 52.90% Name: 10.64.36.129:50010 (analytics1029.eqiad.wmnet) DFS Used: 25038245415956 (22.77 TB) DFS Used%: 52.99% Name: 10.64.53.18:50010 (analytics1039.eqiad.wmnet) DFS Used: 25078144032077 (22.81 TB) DFS Used%: 53.08% Name: 10.64.53.17:50010 (analytics1038.eqiad.wmnet) DFS Used: 25086314642814 (22.82 TB) DFS Used%: 53.10% Name: 10.64.53.14:50010 (analytics1035.eqiad.wmnet) DFS Used: 25096872644001 (22.83 TB) DFS Used%: 53.12% Name: 10.64.36.131:50010 (analytics1031.eqiad.wmnet) DFS Used: 25162887940986 (22.89 TB) DFS Used%: 53.26% Name: 10.64.36.132:50010 (analytics1032.eqiad.wmnet) DFS Used: 25579248652515 (23.26 TB) DFS Used%: 54.14% Name: 10.64.36.134:50010 (analytics1034.eqiad.wmnet) DFS Used: 25952623444683 (23.60 TB) DFS Used%: 54.93% Name: 10.64.36.133:50010 (analytics1033.eqiad.wmnet) DFS Used: 26577437902292 (24.17 TB) DFS Used%: 56.25% Name: 10.64.53.11:50010 (analytics1019.eqiad.wmnet) DFS Used: 15823398196956 (14.39 TB) DFS Used%: 67.21% Name: 10.64.53.12:50010 (analytics1020.eqiad.wmnet) DFS Used: 15834595307796 (14.40 TB) DFS Used%: 67.25% Name: 10.64.36.115:50010 (analytics1015.eqiad.wmnet) DFS Used: 15868752703281 (14.43 TB) DFS Used%: 67.40% Name: 10.64.36.117:50010 (analytics1017.eqiad.wmnet) DFS Used: 15895973328317 (14.46 TB) DFS Used%: 67.52% Name: 10.64.36.116:50010 (analytics1016.eqiad.wmnet) DFS Used: 15898802110703 (14.46 TB) DFS Used%: 67.53% Name: 10.64.36.114:50010 (analytics1014.eqiad.wmnet) DFS Used: 15914265676257 (14.47 TB) DFS Used%: 67.59% Name: 10.64.5.13:50010 (analytics1013.eqiad.wmnet) DFS Used: 15944320425203 (14.50 TB) DFS Used%: 67.72% Name: 10.64.5.11:50010 (analytics1011.eqiad.wmnet) DFS Used: 15960093714655 (14.52 TB) DFS Used%: 67.79%
Here, the total DFS Used%
is 56.48%. Any datanode that is more that 10% points (per default) away from that is considered over-/under-uitilized. So here, the bottom 8 data nodes are above 66.48%, hence over-utilized and need to have data moved onto the other nodes.
Note that the absolute DFS Used
(no %
at the end) ranges greatly. It's already low for the bottom 8 data nodes, although they are over-utilized per the previous paragraph and need to move data to other nodes. The reason is that the total disk space that is reserved for HDFS is lower on those nodes.
Re-balancing HDFS
To rebalance:
- Check that no other balancer is running. (Ask Ottomata. He would know. It's recommended to run at most 1 balancer per cluster)
- Decide on which node to run the balancer on. (In several pages in the www one finds recommendations to run the balancer on a node that does not host Hadoop services. Hence, (at 2018-10-11)
an-coord1001
seems like a sensible choice.) - Adjust the balancing bandwidth. (The default balancing bandwidth of 1MiBi is too little to keep up with the rate of the cluster getting unbalanced. When running in March 2015 40MiBi was sufficient to decrease unbalancedness about 7TB/day, without grinding the cluster to a halt. Higher values might or might not speed up things)
sudo -u hdfs hdfs dfsadmin -setBalancerBandwidth $((40*1048576))
- Start the balancing run by
sudo -u hdfs hdfs balancer
(Each block move that the balancer did is permanent. So when aborting a balancer run at some point, the balancing work done up to then stays effective.)
HDFS Balancer Cron Job
an-coord1001 has a cron job installed that should run the HDFS Balancer daily. However, we have noticed that often the balancer will run for much more than 24 hours, because it is never satisfied with the distribution of blocks around the cluster. Also, it seems that a long running balancer starts to slow down, and not balance well. If this happens, you should probably restart the balancer using the same cron job command.
First, check if a balancer is running that you want to restart:
ps aux | grep '\-Dproc_balancer' | grep -v grep # hdfs 12327 0.7 3.2 1696740 265292 pts/5 Sl+ 13:56 0:10 /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -Dproc_balancer -Xmx1000m -Dhadoop.log.dir=/usr/lib/hadoop/logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/usr/lib/hadoop -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/usr/lib/hadoop/lib/native -Dhadoop.policy.file=hadoop-policy.xml -Djava.net.preferIPv4Stack=true -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.hdfs.server.balancer.Balancer
Login as hdfs user, kill the balancer, delete the balancer lock file, and then run the balancer cron job command:
sudo -u hdfs -i kill 12327 rm /tmp/hdfs-balancer.lock crontab -l | grep balancer # copy paste the balancer full command: (lockfile-check /tmp/hdfs-balancer && echo "$(date '+%y/%m/%d %H:%M:%S') WARN Not starting hdfs balancer, it is already running (or the lockfile exists)." >> /var/log/hadoop-hdfs/balancer.log) || (lockfile-create /tmp/hdfs-balancer && hdfs dfsadmin -setBalancerBandwidth $((40*1048576)) && /usr/bin/hdfs balancer 2>&1 >> /var/log/hadoop-hdfs/balancer.log; lockfile-remove /tmp/hdfs-balancer)
Fixing HDFS mount at /mnt/hdfs
/mnt/hdfs is mounted on some hosts as a way of accessing files in hdfs via the filesystem. There are several production monitoring and rsync jobs that rely on this mount, including rsync jobs that copy pagecount* data to http://dumps.wikimedia.org. If this data is not copied, the community notices.
/mnt/hdfs is mounted using fuse_hdfs, and is not very reliable. If you get hung filesystem commands on /mnt/hdfs, the following worked once to refresh it:
sudo umount -f /mnt/hdfs sudo fusermount -uz /mnt/hdfs sudo mount /mnt/hdfs
Swapping broken disk
It might happen that a disk breaks and you'd need to ask for a swap. The usual procedure is to cut a Phab task to DC Ops and wait for a replacement. When the disk will be put in place, you might need to work with the megacli tool to configure the hardware RAID controller before adding the partitions on the disk. This section contains an example from https://phabricator.wikimedia.org/T134056, that was opened to track a disk replacement for analytics1047.
Procedure:
- Make sure that you familiarize yourself with the ideal disk configuration, that is outlined in this page for the various node types. In this case, analytics1047 is a Hadoop worker node.
- Check the status of the disks after the swap using the following commands:In this case, analytics1047's RAID is configured with 12 Virtual Drives, each one running a RAID0 with one disk (so not simply JBOD). This is the pre-requisite to let the RAID controller to expose the disks to the OS via /dev, otherwise the magic will not happen.
# Check general info about disks attached to the RAID and their status. sudo megacli -PDList -aAll # Get Firmware status only to have a quick peek about disks status: sudo megacli -PDList -aAll | grep Firm # Check Virtual Drive info sudo megacli -LDInfo -LAll -aAll
- Then if you see "Firmware state: Unconfigured(bad)" fix it with:
# From the previous commands you should be able to fill in the variables # with the values of the disk's properties indicated below: # X => Enclosure Device ID # Y => Slot Number # Z => Controller number megacli -PDMakeGood -PhysDrv[X:Y] -aZ # Example: # elukey@analytics1047:~$ sudo megacli -PDList -aAll | egrep "Adapter|Enclosure Device ID:|Slot Number:|Firmware state" # Adapter #0 # Enclosure Device ID: 32 # Slot Number: 0 # Firmware state: Online, Spun Up # [..content cut..] megacli -PDMakeGood -PhysDrv[32:0] -a0
- Check for Foreign disks and fix them if needed:
server:~# megacli -CfgForeign -Scan -a0 There are 1 foreign configuration(s) on controller 0. server:~# megacli -CfgForeign -Clear -a0 Foreign configuration 0 is cleared on controller 0.
- Add the single disk RAID0 array:
sudo megacli -CfgLdAdd -r0 [32:0] -a0
- You should be able to see the disk now using parted or fdisk, now is the time to add the partition and fs to it to complete the work. In this case the disk appeared under /dev/sdd:
sudo parted /dev/sdd --script mklabel gpt sudo parted /dev/sdd --script mkpart primary ext4 0% 100% sudo mkfs.ext4 /dev/sdd1 sudo tune2fs -m 0 /dev/sdd1
Very useful info contained in: http://hwraid.le-vert.net/wiki/LSIMegaRAIDSAS
This guide is not complete of course, so please add details if you have them!
Updating Cloudera Packages
We mirror CDH packges from http://archive.cloudera.com/cdh5/ into apt.wikimedia.org's thirdparty/cloudera component. We mirror Debian Stretch packages via updates rules. To bring in new packages, edit the updates configuration file in Puppet and update the Suite versions for both the cloudera-jessie rules to the version of CDH you want to mirror packages for (note: there are no Stretch packages for CDH, we use the Jessie ones). Merge this and run puppet on apt.wikimedia.org. Then, run reprepro update commands to mirror the new packages:
# These commands should be run as root on the apt.wikimedia.org server (currently install1002.wikimedia.org). # Update CDH packages in jessie-wikimedia # show what should be updated reprepro --noskipold --component thirdparty/cloudera checkupdate jessie-wikimedia # ... # actually do the update! reprepro --noskipold --component thirdparty/cloudera update jessie-wikimedia # ... # Now Update stretch # show what should be updated reprepro --noskipold --component thirdparty/cloudera checkpull stretch-wikimedia # ... # actually do the update! reprepro --noskipold --component thirdparty/cloudera pull stretch-wikimedia # ...