You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

User:QChris/TestClusterSetup

From Wikitech-static
< User:QChris
Revision as of 19:32, 5 May 2015 by imported>Ottomata (→‎Creating instances)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

To setup your own test cluster with Hadoop, Hive, Oozie, and Pig.

When following the below stop-by-step instructions, you'll at the end have a cluster consisting of three nodes:

  • foo-master.eqiad.wmflabs. The cluster's master, and main namenode.
  • foo-worker1.eqiad.wmflabs. A worker node. You can use this to schedule oozie jobs or run hive querie.
  • foo-worker2.eqiad.wmflabs. A worker node. You can use this to schedule oozie jobs or run hive querie.

Please swap foo with your login name if you follow these instructions!

Relevant Bugs

  • bug 68161 because puppet from 2014-07-17 does not allow to bring a cluster up.

Creating instances

  1. Create a master instance.
    • Go to the "Add instance" page.
    • Set Instance name to foo-master.
    • Set Instance type to m1.small [...].
    (A bigger instance won't buy you anything. When testing, you'll most likely not run into disk space problems, but rather "number of filenames"-limitations. So let's be nice to wikitech.)
    • Click “Submit”.
  2. Create a first worker instance.
    • (Use the same settings as for the master instance, but set Instance name to foo-worker1).
  3. Create a second worker instance.
    • (Use the same settings as for the master instance, but set Instance name to foo-worker2).
  4. Wait for all of the instances showing “Puppet status” to say “ok”.
    • (Will take ~15 minutes. The page does not refresh on its own, so reload from time to time)

Setting up Hadoop

  1. Set up Hadoop master
    • Go to the “configure” page for the foo-master instance.
    • Select role::analytics::hadoop::master.
    • Set hadoop_cluster_name to analytics-hadoop.
    • Set hadoop_namenodes to foo-master.eqiad.wmflabs.
    • Click “Submit”.
    • Log in to the foo-master instance.
    • Run sudo puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff.
      • (This wil take ~7 minutes).
  2. Set up first worker instance.
    • Go to the “configure” page for the foo-worker1 instance.
    • Select role::analytics::hadoop::worker.
    • Set hadoop_cluster_name to analytics-hadoop.
    • Set hadoop_namenodes to foo-master.eqiad.wmflabs.
      • (This is no typo, foo-master.eqiad.wmflabs is the namenode.)
    • Click “Submit”.
    • Log in to the foo-worker1 instance.
    • Run sudo puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff.
  3. Set up second worker instance.
    • Repeat the steps from “Set up first worker instance”, but use foo-worker? instead foo-worker1.
  4. Check that the two workers are known to the namenode
    • Log in to the foo-master instance.
    • Run sudo su - -c 'sudo -u hdfs hdfs dfsadmin -printTopology'
    • If everything is working, The output will be a “defaultRack” showing a separate line for each worker node. Like
 Rack: /default-rack
    10.68.17.180:50010 (foo-worker1.eqiad.wmflabs)
    10.68.17.181:50010 (foo-worker2.eqiad.wmflabs)

Setting up Oozie, Hive

  1. Set up Hadoop master
    • Go to the “configure” page for the foo-master instance.
    • Select role::analytics::hive::server.
    • Select role::analytics::oozie::server.
    • Click “Submit”.
    • Log in to the foo-master instance.
    • Run sudo puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff.
  2. Set up first worker instance.
    • Go to the “configure” page for the foo-worker1 instance.
    • Select role::analytics::hive::client.
    • Select role::analytics::oozie::client.
    • Select role::analytics::pig.
    • Click “Submit”.
    • Log in to the foo-worker1 instance.
    • Run sudo puppet agent --onetime --verbose --no-daemonize --no-splay --show_diff.
  3. Set up second worker instance.
    • Repeat the steps from “Set up first worker instance”, but use foo-worker2 instead foo-worker1.

Bootstrapping content

Done. HDFS, Hive, Oozie, Pig are functional.

Sample data

You can use the scripts from the cluster-scripts repository to bootstrap some parts. For example:

  1. Log in to the foo-worker1 instance.
  2. In the cluster-scripts directory, run ./prepare_for_user.sh
  3. Log out.
  4. Log in to the foo-worker1 instance again
  5. In the cluster-scripts directory, run ./hive_add_table_webrequest_sequence_stats.sh
  6. In the cluster-scripts directory, run ./hive_add_table_webrequest.sh
  7. In the cluster-scripts directory, run ./insert_webrequest_fixture_duplicate_monitoring.sh

Sample Deployment

  1. In the refinery directory, run bin/deploy --no-dry-run --verbose

Sample Oozie job

  1. In the refinery directory, run oozie job -config oozie/webrequest/compute_sequence_stats/job.properties -run -D hadoopMaster=foo-master.eqiad.wmflabs

Cluster Tuning

  • Replication issues. (hdfs-site.xml) The default block size is rather big for labs, and it'll pretty quickly give you randem errors about not being able to replicate blocks. You can work around it by setting dfs.namenode.fs-limits.min-block-size to 100, and both dfs.blocksize, and dfs.blocksize to 128k. That should basically kill the replication issue. If you run into it nonetheless, run hdfs dfs -expunge, whenever you run into the issue.
  • Oozie needing 5 minutes between job creations. (oozie-site.xml) Add the property oozie.service.CoordMaterializeTriggerService.lookup.interval with value 2, to get the default down to 2 seconds. Reboot the oozie server afterwards.
  • If yarn applications (e.g.: Hive jobs, Oozie jobs, ...) do not finish in time (like >10 minutes on simple selects), check yarn application -list. If the jobs do not change there either, check mapred job -list. If UsedMem < NeededMem, or RsvdMem > 0 for >5 minutes, you'll likely need more data nodes. Just add foo-worker3, foo-worker4, ... accordingly to the description for foo-worker1 above.
  • If the namenode switches into safemode, it's typically the free disk space (in plain fs) running low on the namenode in /var. Look for big directories in /var/log. After some cleanup there, you can get the namenode out of safe mode by running sudo su - -c "sudo -u hdfs hdfs dfsadmin -safemode leave"