HADOOP-6780. Move Hadoop cloud scripts to Whirr.

git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@951597 13f79535-47bb-0310-9956-ffa450edef68
2010-06-04 22:24:53 +00:00 · 2010-06-04 22:24:53 +00:00 · ff1fe0803a
commit ff1fe0803a
parent ea5200d922
32 changed files with 0 additions and 5042 deletions
--- a/src/contrib/cloud/README.txt
+++ b/src/contrib/cloud/README.txt
@ -1,497 +0,0 @@
 Hadoop Cloud Scripts
 ====================
 These scripts allow you to run Hadoop on cloud providers. These instructions
 assume you are running on Amazon EC2, the differences for other providers are
 noted at the end of this document.
 Getting Started
 ===============
 First, unpack the scripts on your system. For convenience, you may like to put
 the top-level directory on your path.
 You'll also need python (version 2.5 or newer) and the boto and simplejson
 libraries. After you download boto and simplejson, you can install each in turn
 by running the following in the directory where you unpacked the distribution:
 % sudo python setup.py install
 Alternatively, you might like to use the python-boto and python-simplejson RPM
 and Debian packages.
 You need to tell the scripts your AWS credentials. The simplest way to do this
 is to set the environment variables (but see
 http://code.google.com/p/boto/wiki/BotoConfig for other options):
    * AWS_ACCESS_KEY_ID - Your AWS Access Key ID
    * AWS_SECRET_ACCESS_KEY - Your AWS Secret Access Key
 To configure the scripts, create a directory called .hadoop-cloud (note the
 leading ".") in your home directory. In it, create a file called
 clusters.cfg with a section for each cluster you want to control. e.g.:
 [my-hadoop-cluster]
 image_id=ami-6159bf08
 instance_type=c1.medium
 key_name=tom
 availability_zone=us-east-1c
 private_key=PATH_TO_PRIVATE_KEY
 ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
 The image chosen here is one with a i386 Fedora OS. For a list of suitable AMIs
 see http://wiki.apache.org/hadoop/AmazonEC2.
 The architecture must be compatible with the instance type. For m1.small and
 c1.medium instances use the i386 AMIs, while for m1.large, m1.xlarge, and
 c1.xlarge instances use the x86_64 AMIs. One of the high CPU instances
 (c1.medium or c1.xlarge) is recommended.
 Then you can run the hadoop-ec2 script. It will display usage instructions when
 invoked without arguments.
 You can test that it can connect to AWS by typing:
 % hadoop-ec2 list
 LAUNCHING A CLUSTER
 ===================
 To launch a cluster called "my-hadoop-cluster" with 10 worker (slave) nodes
 type:
 % hadoop-ec2 launch-cluster my-hadoop-cluster 10
 This will boot the master node and 10 worker nodes. The master node runs the
 namenode, secondary namenode, and jobtracker, and each worker node runs a
 datanode and a tasktracker. Equivalently the cluster could be launched as:
 % hadoop-ec2 launch-cluster my-hadoop-cluster 1 nn,snn,jt 10 dn,tt
 Note that using this notation you can launch a split namenode/jobtracker cluster
 % hadoop-ec2 launch-cluster my-hadoop-cluster 1 nn,snn 1 jt 10 dn,tt
 When the nodes have started and the Hadoop cluster has come up, the console will
 display a message like
  Browse the cluster at http://ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com/
 You can access Hadoop's web UI by visiting this URL. By default, port 80 is
 opened for access from your client machine. You may change the firewall settings
 (to allow access from a network, rather than just a single machine, for example)
 by using the Amazon EC2 command line tools, or by using a tool like Elastic Fox.
 There is a security group for each node's role. The one for the namenode
 is <cluster-name>-nn, for example.
 For security reasons, traffic from the network your client is running on is
 proxied through the master node of the cluster using an SSH tunnel (a SOCKS
 proxy on port 6666). To set up the proxy run the following command:
 % hadoop-ec2 proxy my-hadoop-cluster
 Web browsers need to be configured to use this proxy too, so you can view pages
 served by worker nodes in the cluster. The most convenient way to do this is to
 use a proxy auto-config (PAC) file, such as this one:
  http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac
 If you are using Firefox, then you may find
 FoxyProxy useful for managing PAC files. (If you use FoxyProxy, then you need to
 get it to use the proxy for DNS lookups. To do this, go to Tools -> FoxyProxy ->
 Options, and then under "Miscellaneous" in the bottom left, choose "Use SOCKS
 proxy for DNS lookups".)
 PERSISTENT CLUSTERS
 ===================
 Hadoop clusters running on EC2 that use local EC2 storage (the default) will not
 retain data once the cluster has been terminated. It is possible to use EBS for
 persistent data, which allows a cluster to be shut down while it is not being
 used.
 Note: EBS support is a Beta feature.
 First create a new section called "my-ebs-cluster" in the
 .hadoop-cloud/clusters.cfg file.
 Now we need to create storage for the new cluster. Create a temporary EBS volume
 of size 100GiB, format it, and save it as a snapshot in S3. This way, we only
 have to do the formatting once.
 % hadoop-ec2 create-formatted-snapshot my-ebs-cluster 100
 We create storage for a single namenode and for two datanodes. The volumes to
 create are described in a JSON spec file, which references the snapshot we just
 created. Here is the contents of a JSON file, called
 my-ebs-cluster-storage-spec.json:
 {
  "nn": [
    {
      "device": "/dev/sdj",
      "mount_point": "/ebs1",
      "size_gb": "100",
      "snapshot_id": "snap-268e704f"
    },
    {
      "device": "/dev/sdk",
      "mount_point": "/ebs2",
      "size_gb": "100",
      "snapshot_id": "snap-268e704f"
    }
  ],
  "dn": [
    {
      "device": "/dev/sdj",
      "mount_point": "/ebs1",
      "size_gb": "100",
      "snapshot_id": "snap-268e704f"
    },
    {
      "device": "/dev/sdk",
      "mount_point": "/ebs2",
      "size_gb": "100",
      "snapshot_id": "snap-268e704f"
    }
  ]
 }
 Each role (here "nn" and "dn") is the key to an array of volume
 specifications. In this example, the "slave" role has two devices ("/dev/sdj"
 and "/dev/sdk") with different mount points, sizes, and generated from an EBS
 snapshot. The snapshot is the formatted snapshot created earlier, so that the
 volumes we create are pre-formatted. The size of the drives must match the size
 of the snapshot created earlier.
 Let's create actual volumes using this file.
 % hadoop-ec2 create-storage my-ebs-cluster nn 1 \
    my-ebs-cluster-storage-spec.json
 % hadoop-ec2 create-storage my-ebs-cluster dn 2 \
    my-ebs-cluster-storage-spec.json
 Now let's start the cluster with 2 slave nodes:
 % hadoop-ec2 launch-cluster my-ebs-cluster 2
 Login and run a job which creates some output.
 % hadoop-ec2 login my-ebs-cluster
 # hadoop fs -mkdir input
 # hadoop fs -put /etc/hadoop/conf/*.xml input
 # hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar grep input output \
    'dfs[a-z.]+'
 Look at the output:
 # hadoop fs -cat output/part-00000 | head
 Now let's shutdown the cluster.
 % hadoop-ec2 terminate-cluster my-ebs-cluster
 A little while later we restart the cluster and login.
 % hadoop-ec2 launch-cluster my-ebs-cluster 2
 % hadoop-ec2 login my-ebs-cluster
 The output from the job we ran before should still be there:
 # hadoop fs -cat output/part-00000 | head
 RUNNING JOBS
 ============
 When you launched the cluster, a hadoop-site.xml file was created in the
 directory ~/.hadoop-cloud/<cluster-name>. You can use this to connect to the
 cluster by setting the HADOOP_CONF_DIR enviroment variable (it is also possible
 to set the configuration file to use by passing it as a -conf option to Hadoop
 Tools):
 % export HADOOP_CONF_DIR=~/.hadoop-cloud/my-hadoop-cluster
 Let's try browsing HDFS:
 % hadoop fs -ls /
 Running a job is straightforward:
 % hadoop fs -mkdir input # create an input directory
 % hadoop fs -put $HADOOP_HOME/LICENSE.txt input # copy a file there
 % hadoop jar $HADOOP_HOME/hadoop-*-examples.jar wordcount input output
 % hadoop fs -cat output/part-00000 | head
 Of course, these examples assume that you have installed Hadoop on your local
 machine. It is also possible to launch jobs from within the cluster. First log
 into the namenode:
 % hadoop-ec2 login my-hadoop-cluster
 Then run a job as before:
 # hadoop fs -mkdir input
 # hadoop fs -put /etc/hadoop/conf/*.xml input
 # hadoop jar /usr/lib/hadoop/hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
 # hadoop fs -cat output/part-00000 | head
 TERMINATING A CLUSTER
 =====================
 When you've finished with your cluster you can stop it with the following
 command.
 NOTE: ALL DATA WILL BE LOST UNLESS YOU ARE USING EBS!
 % hadoop-ec2 terminate-cluster my-hadoop-cluster
 You can then delete the EC2 security groups with:
 % hadoop-ec2 delete-cluster my-hadoop-cluster
 AUTOMATIC CLUSTER SHUTDOWN
 ==========================
 You may use the --auto-shutdown option to automatically terminate a cluster
 a given time (specified in minutes) after launch. This is useful for short-lived
 clusters where the jobs complete in a known amount of time.
 If you want to cancel the automatic shutdown, then run
 % hadoop-ec2 exec my-hadoop-cluster shutdown -c
 % hadoop-ec2 update-slaves-file my-hadoop-cluster
 % hadoop-ec2 exec my-hadoop-cluster /usr/lib/hadoop/bin/slaves.sh shutdown -c
 CONFIGURATION NOTES
 ===================
 It is possible to specify options on the command line: these take precedence
 over any specified in the configuration file. For example:
 % hadoop-ec2 launch-cluster --image-id ami-2359bf4a --instance-type c1.xlarge \
  my-hadoop-cluster 10
 This command launches a 10-node cluster using the specified image and instance
 type, overriding the equivalent settings (if any) that are in the
 "my-hadoop-cluster" section of the configuration file. Note that words in
 options are separated by hyphens (--instance-type) while the corresponding
 configuration parameter is are separated by underscores (instance_type).
 The scripts install Hadoop RPMs or Debian packages (depending on the OS) at
 instance boot time.
 By default, Apache Hadoop 0.20.1 is installed. You can also run other versions
 of Apache Hadoop. For example the following uses version 0.18.3:
 % hadoop-ec2 launch-cluster --env HADOOP_VERSION=0.18.3 \
  my-hadoop-cluster 10
 CUSTOMIZATION
 =============
 You can specify a list of packages to install on every instance at boot time
 using the --user-packages command-line option (or the user_packages
 configuration parameter). Packages should be space-separated. Note that package
 names should reflect the package manager being used to install them (yum or
 apt-get depending on the OS).
 Here's an example that installs RPMs for R and git:
 % hadoop-ec2 launch-cluster --user-packages 'R git-core' my-hadoop-cluster 10
 You have full control over the script that is run when each instance boots. The
 default script, hadoop-ec2-init-remote.sh, may be used as a starting point to
 add extra configuration or customization of the instance. Make a copy of the
 script in your home directory, or somewhere similar, and set the
 --user-data-file command-line option (or the user_data_file configuration
 parameter) to point to the (modified) copy.  hadoop-ec2 will replace "%ENV%"
 in your user data script with
 USER_PACKAGES, AUTO_SHUTDOWN, and EBS_MAPPINGS, as well as extra parameters
 supplied using the --env commandline flag.
 Another way of customizing the instance, which may be more appropriate for
 larger changes, is to create you own image.
 It's possible to use any image, as long as it i) runs (gzip compressed) user
 data on boot, and ii) has Java installed.
 OTHER SERVICES
 ==============
 ZooKeeper
 =========
 You can run ZooKeeper by setting the "service" parameter to "zookeeper". For
 example:
 [my-zookeeper-cluster]
 service=zookeeper
 ami=ami-ed59bf84
 instance_type=m1.small
 key_name=tom
 availability_zone=us-east-1c
 public_key=PATH_TO_PUBLIC_KEY
 private_key=PATH_TO_PRIVATE_KEY
 Then to launch a three-node ZooKeeper ensemble, run:
 % ./hadoop-ec2 launch-cluster my-zookeeper-cluster 3 zk
 PROVIDER-SPECIFIC DETAILS
 =========================
 Rackspace
 =========
 Running on Rackspace is very similar to running on EC2, with a few minor
 differences noted here.
 Security Warning
 ================
 Currently, Hadoop clusters on Rackspace are insecure since they don't run behind
 a firewall.
 Creating an image
 =================
 Rackspace doesn't support shared images, so you will need to build your own base
 image to get started. See "Instructions for creating an image" at the end of
 this document for details.
 Installation
 ============
 To run on rackspace you need to install libcloud by checking out the latest
 source from Apache:
 git clone git://git.apache.org/libcloud.git
 cd libcloud; python setup.py install
 Set up your Rackspace credentials by exporting the following environment
 variables:
    * RACKSPACE_KEY - Your Rackspace user name
    * RACKSPACE_SECRET - Your Rackspace API key
 Configuration
 =============
 The cloud_provider parameter must be set to specify Rackspace as the provider.
 Here is a typical configuration:
 [my-rackspace-cluster]
 cloud_provider=rackspace
 image_id=200152
 instance_type=4
 public_key=/path/to/public/key/file
 private_key=/path/to/private/key/file
 ssh_options=-i %(private_key)s -o StrictHostKeyChecking=no
 It's a good idea to create a dedicated key using a command similar to:
 ssh-keygen -f id_rsa_rackspace -P ''
 Launching a cluster
 ===================
 Use the "hadoop-cloud" command instead of "hadoop-ec2".
 After launching a cluster you need to manually add a hostname mapping for the
 master node to your client's /etc/hosts to get it to work. This is because DNS
 isn't set up for the cluster nodes so your client won't resolve their addresses.
 You can do this with
 hadoop-cloud list my-rackspace-cluster | grep 'nn,snn,jt' \
 | awk '{print $4 " " $3 }'  | sudo tee -a /etc/hosts
 Instructions for creating an image
 ==================================
 First set your Rackspace credentials:
 export RACKSPACE_KEY=<Your Rackspace user name>
 export RACKSPACE_SECRET=<Your Rackspace API key>
 Now create an authentication token for the session, and retrieve the server
 management URL to perform operations against.
 # Final SED is to remove trailing ^M
 AUTH_TOKEN=`curl -D - -H X-Auth-User:$RACKSPACE_KEY \
  -H X-Auth-Key:$RACKSPACE_SECRET https://auth.api.rackspacecloud.com/v1.0 \
  | grep 'X-Auth-Token:' | awk '{print $2}' | sed 's/.$//'`
 SERVER_MANAGEMENT_URL=`curl -D - -H X-Auth-User:$RACKSPACE_KEY \
  -H X-Auth-Key:$RACKSPACE_SECRET https://auth.api.rackspacecloud.com/v1.0 \
  | grep 'X-Server-Management-Url:' | awk '{print $2}' | sed 's/.$//'`
 echo $AUTH_TOKEN
 echo $SERVER_MANAGEMENT_URL
 You can get a list of images with the following
 curl -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/images
 Here's the same query, but with pretty-printed XML output:
 curl -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/images.xml | xmllint --format -
 There are similar queries for flavors and running instances:
 curl -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/flavors.xml | xmllint --format -
 curl -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/servers.xml | xmllint --format -
 The following command will create a new server. In this case it will create a
 2GB Ubuntu 8.10 instance, as determined by the imageId and flavorId attributes.
 The name of the instance is set to something meaningful too.
 curl -v -X POST -H X-Auth-Token:$AUTH_TOKEN -H 'Content-type: text/xml' -d @- $SERVER_MANAGEMENT_URL/servers << EOF
 <server xmlns="http://docs.rackspacecloud.com/servers/api/v1.0" name="apache-hadoop-ubuntu-8.10-base" imageId="11" flavorId="4">
  <metadata/>
 </server>
 EOF
 Make a note of the new server's ID, public IP address and admin password as you
 will need these later.
 You can check the status of the server with
 curl -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/servers/$SERVER_ID.xml | xmllint --format -
 When it has started (status "ACTIVE"), copy the setup script over:
 scp tools/rackspace/remote-setup.sh root@$SERVER:remote-setup.sh
 Log in to and run the setup script (you will need to manually accept the
 Sun Java license):
 sh remote-setup.sh
 Once the script has completed, log out and create an image of the running
 instance (giving it a memorable name):
 curl -v -X POST -H X-Auth-Token:$AUTH_TOKEN -H 'Content-type: text/xml' -d @- $SERVER_MANAGEMENT_URL/images << EOF
 <image xmlns="http://docs.rackspacecloud.com/servers/api/v1.0" name="Apache Hadoop Ubuntu 8.10" serverId="$SERVER_ID" />
 EOF
 Keep a note of the image ID as this is what you will use to launch fresh
 instances from.
 You can check the status of the image with
 curl -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/images/$IMAGE_ID.xml | xmllint --format -
 When it's "ACTIVE" is is ready for use. It's important to realize that you have
 to keep the server from which you generated the image running for as long as the
 image is in use.
 However, if you want to clean up an old instance run:
 curl -X DELETE -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/servers/$SERVER_ID
 Similarly, you can delete old images:
 curl -X DELETE -H X-Auth-Token:$AUTH_TOKEN $SERVER_MANAGEMENT_URL/images/$IMAGE_ID
--- a/src/contrib/cloud/build.xml
+++ b/src/contrib/cloud/build.xml
@ -1,45 +0,0 @@
 <?xml version="1.0"?>
 <!--
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
 -->
 <project name="hadoop-cloud" default="test-py">
  <property name="lib.dir" value="${basedir}/lib"/>
  <path id="java.classpath">
    <fileset dir="${lib.dir}">
      <include name="**/*.jar" />
    </fileset>
  </path>
  <path id="test.py.path">
    <pathelement location="${basedir}/src/py"/>
    <pathelement location="${basedir}/src/test/py"/>
  </path>
  <target name="test-py" description="Run python unit tests">
    <taskdef name="py-test" classname="org.pyant.tasks.PythonTestTask">
      <classpath refid="java.classpath" />
    </taskdef>
    <py-test python="python" pythonpathref="test.py.path" >
      <fileset dir="${basedir}/src/test/py">
        <include name="*.py"/>
      </fileset>
    </py-test>
  </target>
  <target name="compile"/>
  <target name="package"/>
  <target name="test" depends="test-py"/>
  <target name="clean"/>
 </project>
--- a/src/contrib/cloud/lib/pyAntTasks-1.3-LICENSE.txt
+++ b/src/contrib/cloud/lib/pyAntTasks-1.3-LICENSE.txt
@ -1,202 +0,0 @@
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/
   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
   1. Definitions.
      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.
      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.
      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.
      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.
      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.
      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.
      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).
      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.
      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."
      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.
   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.
   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.
   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:
      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and
      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and
      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and
      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.
      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.
   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.
   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.
   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.
   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.
   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.
   END OF TERMS AND CONDITIONS
   APPENDIX: How to apply the Apache License to your work.
      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.
   Copyright [yyyy] [name of copyright owner]
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
--- a/src/contrib/cloud/lib/pyAntTasks-1.3.jar
+++ b/src/contrib/cloud/lib/pyAntTasks-1.3.jar
--- a/src/contrib/cloud/src/integration-test/create-ebs-snapshot.sh
+++ b/src/contrib/cloud/src/integration-test/create-ebs-snapshot.sh
@ -1,52 +0,0 @@
 #!/usr/bin/env bash
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 # This script tests the "hadoop-ec2 create-formatted-snapshot" command.
 # The snapshot is deleted immediately afterwards.
 #
 # Example usage:
 # ./create-ebs-snapshot.sh
 #
 set -e
 set -x
 bin=`dirname "$0"`
 bin=`cd "$bin"; pwd`
 WORKSPACE=${WORKSPACE:-`pwd`}
 CONFIG_DIR=${CONFIG_DIR:-$WORKSPACE/.hadoop-cloud}
 CLUSTER=${CLUSTER:-hadoop-cloud-$USER-test-cluster}
 AVAILABILITY_ZONE=${AVAILABILITY_ZONE:-us-east-1c}
 KEY_NAME=${KEY_NAME:-$USER}
 HADOOP_CLOUD_HOME=${HADOOP_CLOUD_HOME:-$bin/../py}
 HADOOP_CLOUD_PROVIDER=${HADOOP_CLOUD_PROVIDER:-ec2}
 SSH_OPTIONS=${SSH_OPTIONS:-"-i ~/.$HADOOP_CLOUD_PROVIDER/id_rsa-$KEY_NAME \
  -o StrictHostKeyChecking=no"}
 HADOOP_CLOUD_SCRIPT=$HADOOP_CLOUD_HOME/hadoop-$HADOOP_CLOUD_PROVIDER
 $HADOOP_CLOUD_SCRIPT create-formatted-snapshot --config-dir=$CONFIG_DIR \
  --key-name=$KEY_NAME --availability-zone=$AVAILABILITY_ZONE \
  --ssh-options="$SSH_OPTIONS" \
  $CLUSTER 1 > out.tmp
 snapshot_id=`grep 'Created snapshot' out.tmp | awk '{print $3}'`
 ec2-delete-snapshot $snapshot_id
 rm -f out.tmp
--- a/src/contrib/cloud/src/integration-test/ebs-storage-spec.json
+++ b/src/contrib/cloud/src/integration-test/ebs-storage-spec.json
@ -1,30 +0,0 @@
 {
  "nn": [
    {
      "device": "/dev/sdj",
      "mount_point": "/ebs1",
      "size_gb": "7",
      "snapshot_id": "snap-fe44bb97"
    },
    {
      "device": "/dev/sdk",
      "mount_point": "/ebs2",
      "size_gb": "7",
      "snapshot_id": "snap-fe44bb97"
    }
  ],
  "dn": [
    {
      "device": "/dev/sdj",
      "mount_point": "/ebs1",
      "size_gb": "7",
      "snapshot_id": "snap-fe44bb97"
    },
    {
      "device": "/dev/sdk",
      "mount_point": "/ebs2",
      "size_gb": "7",
      "snapshot_id": "snap-fe44bb97"
    }
  ]
 }
--- a/src/contrib/cloud/src/integration-test/persistent-cluster.sh
+++ b/src/contrib/cloud/src/integration-test/persistent-cluster.sh
@ -1,122 +0,0 @@
 #!/usr/bin/env bash
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 # This script tests the Hadoop cloud scripts by running through a minimal
 # sequence of steps to start a persistent (EBS) cluster, run a job, then
 # shutdown the cluster.
 #
 # Example usage:
 # HADOOP_HOME=~/dev/hadoop-0.20.1/ ./persistent-cluster.sh
 #
 function wait_for_volume_detachment() {
  set +e
  set +x
  while true; do
    attached=`$HADOOP_CLOUD_SCRIPT list-storage --config-dir=$CONFIG_DIR \
      $CLUSTER | awk '{print $6}' | grep 'attached'`
    sleep 5
    if [ -z "$attached" ]; then
      break
    fi
  done
  set -e
  set -x
 }
 set -e
 set -x
 bin=`dirname "$0"`
 bin=`cd "$bin"; pwd`
 WORKSPACE=${WORKSPACE:-`pwd`}
 CONFIG_DIR=${CONFIG_DIR:-$WORKSPACE/.hadoop-cloud}
 CLUSTER=${CLUSTER:-hadoop-cloud-ebs-$USER-test-cluster}
 IMAGE_ID=${IMAGE_ID:-ami-6159bf08} # default to Fedora 32-bit AMI
 AVAILABILITY_ZONE=${AVAILABILITY_ZONE:-us-east-1c}
 KEY_NAME=${KEY_NAME:-$USER}
 AUTO_SHUTDOWN=${AUTO_SHUTDOWN:-15}
 LOCAL_HADOOP_VERSION=${LOCAL_HADOOP_VERSION:-0.20.1}
 HADOOP_HOME=${HADOOP_HOME:-$WORKSPACE/hadoop-$LOCAL_HADOOP_VERSION}
 HADOOP_CLOUD_HOME=${HADOOP_CLOUD_HOME:-$bin/../py}
 HADOOP_CLOUD_PROVIDER=${HADOOP_CLOUD_PROVIDER:-ec2}
 SSH_OPTIONS=${SSH_OPTIONS:-"-i ~/.$HADOOP_CLOUD_PROVIDER/id_rsa-$KEY_NAME \
  -o StrictHostKeyChecking=no"}
 HADOOP_CLOUD_SCRIPT=$HADOOP_CLOUD_HOME/hadoop-$HADOOP_CLOUD_PROVIDER
 export HADOOP_CONF_DIR=$CONFIG_DIR/$CLUSTER
 # Install Hadoop locally
 if [ ! -d $HADOOP_HOME ]; then
  wget http://archive.apache.org/dist/hadoop/core/hadoop-\
 $LOCAL_HADOOP_VERSION/hadoop-$LOCAL_HADOOP_VERSION.tar.gz
  tar zxf hadoop-$LOCAL_HADOOP_VERSION.tar.gz -C $WORKSPACE
  rm hadoop-$LOCAL_HADOOP_VERSION.tar.gz
 fi
 # Create storage
 $HADOOP_CLOUD_SCRIPT create-storage --config-dir=$CONFIG_DIR \
  --availability-zone=$AVAILABILITY_ZONE $CLUSTER nn 1 \
  $bin/ebs-storage-spec.json
 $HADOOP_CLOUD_SCRIPT create-storage --config-dir=$CONFIG_DIR \
  --availability-zone=$AVAILABILITY_ZONE $CLUSTER dn 1 \
  $bin/ebs-storage-spec.json
 # Launch a cluster
 $HADOOP_CLOUD_SCRIPT launch-cluster --config-dir=$CONFIG_DIR \
  --image-id=$IMAGE_ID --key-name=$KEY_NAME --auto-shutdown=$AUTO_SHUTDOWN \
  --availability-zone=$AVAILABILITY_ZONE $CLIENT_CIDRS $ENVS $CLUSTER 1
 # Run a proxy and save its pid in HADOOP_CLOUD_PROXY_PID
 eval `$HADOOP_CLOUD_SCRIPT proxy --config-dir=$CONFIG_DIR \
  --ssh-options="$SSH_OPTIONS" $CLUSTER`
 # Run a job and check it works
 $HADOOP_HOME/bin/hadoop fs -mkdir input
 $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/LICENSE.txt input
 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep \
  input output Apache
 # following returns a non-zero exit code if no match
 $HADOOP_HOME/bin/hadoop fs -cat 'output/part-00000' | grep Apache
 # Shutdown the cluster
 kill $HADOOP_CLOUD_PROXY_PID
 $HADOOP_CLOUD_SCRIPT terminate-cluster --config-dir=$CONFIG_DIR --force $CLUSTER
 sleep 5 # wait for termination to take effect
 # Relaunch the cluster
 $HADOOP_CLOUD_SCRIPT launch-cluster --config-dir=$CONFIG_DIR \
  --image-id=$IMAGE_ID --key-name=$KEY_NAME --auto-shutdown=$AUTO_SHUTDOWN \
  --availability-zone=$AVAILABILITY_ZONE $CLIENT_CIDRS $ENVS $CLUSTER 1
 # Run a proxy and save its pid in HADOOP_CLOUD_PROXY_PID
 eval `$HADOOP_CLOUD_SCRIPT proxy --config-dir=$CONFIG_DIR \
  --ssh-options="$SSH_OPTIONS" $CLUSTER`
 # Check output is still there
 $HADOOP_HOME/bin/hadoop fs -cat 'output/part-00000' | grep Apache
 # Shutdown the cluster
 kill $HADOOP_CLOUD_PROXY_PID
 $HADOOP_CLOUD_SCRIPT terminate-cluster --config-dir=$CONFIG_DIR --force $CLUSTER
 sleep 5 # wait for termination to take effect
 # Cleanup
 $HADOOP_CLOUD_SCRIPT delete-cluster --config-dir=$CONFIG_DIR $CLUSTER
 wait_for_volume_detachment
 $HADOOP_CLOUD_SCRIPT delete-storage --config-dir=$CONFIG_DIR --force $CLUSTER
--- a/src/contrib/cloud/src/integration-test/transient-cluster.sh
+++ b/src/contrib/cloud/src/integration-test/transient-cluster.sh
@ -1,112 +0,0 @@
 #!/usr/bin/env bash
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 # This script tests the Hadoop cloud scripts by running through a minimal
 # sequence of steps to start a cluster, run a job, then shutdown the cluster.
 #
 # Example usage:
 # HADOOP_HOME=~/dev/hadoop-0.20.1/ ./transient-cluster.sh
 #
 set -e
 set -x
 bin=`dirname "$0"`
 bin=`cd "$bin"; pwd`
 WORKSPACE=${WORKSPACE:-`pwd`}
 CONFIG_DIR=${CONFIG_DIR:-$WORKSPACE/.hadoop-cloud}
 CLUSTER=${CLUSTER:-hadoop-cloud-$USER-test-cluster}
 IMAGE_ID=${IMAGE_ID:-ami-6159bf08} # default to Fedora 32-bit AMI
 INSTANCE_TYPE=${INSTANCE_TYPE:-m1.small}
 AVAILABILITY_ZONE=${AVAILABILITY_ZONE:-us-east-1c}
 KEY_NAME=${KEY_NAME:-$USER}
 AUTO_SHUTDOWN=${AUTO_SHUTDOWN:-15}
 LOCAL_HADOOP_VERSION=${LOCAL_HADOOP_VERSION:-0.20.1}
 HADOOP_HOME=${HADOOP_HOME:-$WORKSPACE/hadoop-$LOCAL_HADOOP_VERSION}
 HADOOP_CLOUD_HOME=${HADOOP_CLOUD_HOME:-$bin/../py}
 HADOOP_CLOUD_PROVIDER=${HADOOP_CLOUD_PROVIDER:-ec2}
 PUBLIC_KEY=${PUBLIC_KEY:-~/.$HADOOP_CLOUD_PROVIDER/id_rsa-$KEY_NAME.pub}
 PRIVATE_KEY=${PRIVATE_KEY:-~/.$HADOOP_CLOUD_PROVIDER/id_rsa-$KEY_NAME}
 SSH_OPTIONS=${SSH_OPTIONS:-"-i $PRIVATE_KEY -o StrictHostKeyChecking=no"}
 LAUNCH_ARGS=${LAUNCH_ARGS:-"1 nn,snn,jt 1 dn,tt"}
 HADOOP_CLOUD_SCRIPT=$HADOOP_CLOUD_HOME/hadoop-cloud
 export HADOOP_CONF_DIR=$CONFIG_DIR/$CLUSTER
 # Install Hadoop locally
 if [ ! -d $HADOOP_HOME ]; then
  wget http://archive.apache.org/dist/hadoop/core/hadoop-\
 $LOCAL_HADOOP_VERSION/hadoop-$LOCAL_HADOOP_VERSION.tar.gz
  tar zxf hadoop-$LOCAL_HADOOP_VERSION.tar.gz -C $WORKSPACE
  rm hadoop-$LOCAL_HADOOP_VERSION.tar.gz
 fi
 # Launch a cluster
 if [ $HADOOP_CLOUD_PROVIDER == 'ec2' ]; then
  $HADOOP_CLOUD_SCRIPT launch-cluster \
    --config-dir=$CONFIG_DIR \
    --image-id=$IMAGE_ID \
    --instance-type=$INSTANCE_TYPE \
    --key-name=$KEY_NAME \
    --auto-shutdown=$AUTO_SHUTDOWN \
    --availability-zone=$AVAILABILITY_ZONE \
    $CLIENT_CIDRS $ENVS $CLUSTER $LAUNCH_ARGS
 else
  $HADOOP_CLOUD_SCRIPT launch-cluster --cloud-provider=$HADOOP_CLOUD_PROVIDER \
    --config-dir=$CONFIG_DIR \
    --image-id=$IMAGE_ID \
    --instance-type=$INSTANCE_TYPE \
    --public-key=$PUBLIC_KEY \
    --private-key=$PRIVATE_KEY \
    --auto-shutdown=$AUTO_SHUTDOWN \
    $CLIENT_CIDRS $ENVS $CLUSTER $LAUNCH_ARGS
 fi
 # List clusters
 $HADOOP_CLOUD_SCRIPT list --cloud-provider=$HADOOP_CLOUD_PROVIDER \
  --config-dir=$CONFIG_DIR
 $HADOOP_CLOUD_SCRIPT list --cloud-provider=$HADOOP_CLOUD_PROVIDER \
  --config-dir=$CONFIG_DIR $CLUSTER
 # Run a proxy and save its pid in HADOOP_CLOUD_PROXY_PID
 eval `$HADOOP_CLOUD_SCRIPT proxy --cloud-provider=$HADOOP_CLOUD_PROVIDER \
  --config-dir=$CONFIG_DIR \
  --ssh-options="$SSH_OPTIONS" $CLUSTER`
 if [ $HADOOP_CLOUD_PROVIDER == 'rackspace' ]; then
  # Need to update /etc/hosts (interactively)
  $HADOOP_CLOUD_SCRIPT list --cloud-provider=$HADOOP_CLOUD_PROVIDER \
    --config-dir=$CONFIG_DIR $CLUSTER | grep 'nn,snn,jt' \
    | awk '{print $4 " " $3 }'  | sudo tee -a /etc/hosts
 fi
 # Run a job and check it works
 $HADOOP_HOME/bin/hadoop fs -mkdir input
 $HADOOP_HOME/bin/hadoop fs -put $HADOOP_HOME/LICENSE.txt input
 $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-*-examples.jar grep \
  input output Apache
 # following returns a non-zero exit code if no match
 $HADOOP_HOME/bin/hadoop fs -cat 'output/part-00000' | grep Apache
 # Shutdown the cluster
 kill $HADOOP_CLOUD_PROXY_PID
 $HADOOP_CLOUD_SCRIPT terminate-cluster --cloud-provider=$HADOOP_CLOUD_PROVIDER \
  --config-dir=$CONFIG_DIR --force $CLUSTER
 sleep 5 # wait for termination to take effect
 $HADOOP_CLOUD_SCRIPT delete-cluster --cloud-provider=$HADOOP_CLOUD_PROVIDER \
  --config-dir=$CONFIG_DIR $CLUSTER
--- a/src/contrib/cloud/src/py/hadoop-cloud
+++ b/src/contrib/cloud/src/py/hadoop-cloud
@ -1,21 +0,0 @@
 #!/usr/bin/env python2.5
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from hadoop.cloud.cli import main
 if __name__ == "__main__":
  main()
--- a/src/contrib/cloud/src/py/hadoop-ec2
+++ b/src/contrib/cloud/src/py/hadoop-ec2
@ -1,21 +0,0 @@
 #!/usr/bin/env python2.5
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from hadoop.cloud.cli import main
 if __name__ == "__main__":
  main()
--- a/src/contrib/cloud/src/py/hadoop/init.py
+++ b/src/contrib/cloud/src/py/hadoop/init.py
@ -1,14 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
--- a/src/contrib/cloud/src/py/hadoop/cloud/init.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/init.py
@ -1,15 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 VERSION="0.22.0"
--- a/src/contrib/cloud/src/py/hadoop/cloud/cli.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/cli.py
@ -1,438 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from __future__ import with_statement
 import ConfigParser
 from hadoop.cloud import VERSION
 from hadoop.cloud.cluster import get_cluster
 from hadoop.cloud.service import get_service
 from hadoop.cloud.service import InstanceTemplate
 from hadoop.cloud.service import NAMENODE
 from hadoop.cloud.service import SECONDARY_NAMENODE
 from hadoop.cloud.service import JOBTRACKER
 from hadoop.cloud.service import DATANODE
 from hadoop.cloud.service import TASKTRACKER
 from hadoop.cloud.util import merge_config_with_options
 from hadoop.cloud.util import xstr
 import logging
 from optparse import OptionParser
 from optparse import make_option
 import os
 import sys
 DEFAULT_SERVICE_NAME = 'hadoop'
 DEFAULT_CLOUD_PROVIDER = 'ec2'
 DEFAULT_CONFIG_DIR_NAME = '.hadoop-cloud'
 DEFAULT_CONFIG_DIR = os.path.join(os.environ['HOME'], DEFAULT_CONFIG_DIR_NAME)
 CONFIG_FILENAME = 'clusters.cfg'
 CONFIG_DIR_OPTION = \
  make_option("--config-dir", metavar="CONFIG-DIR",
    help="The configuration directory.")
 PROVIDER_OPTION = \
  make_option("--cloud-provider", metavar="PROVIDER",
    help="The cloud provider, e.g. 'ec2' for Amazon EC2.")
 BASIC_OPTIONS = [
  CONFIG_DIR_OPTION,
  PROVIDER_OPTION,
 ]
 LAUNCH_OPTIONS = [
  CONFIG_DIR_OPTION,
  PROVIDER_OPTION,
  make_option("-a", "--ami", metavar="AMI",
    help="The AMI ID of the image to launch. (Amazon EC2 only. Deprecated, use \
 --image-id.)"),
  make_option("-e", "--env", metavar="ENV", action="append",
    help="An environment variable to pass to instances. \
 (May be specified multiple times.)"),
  make_option("-f", "--user-data-file", metavar="URL",
    help="The URL of the file containing user data to be made available to \
 instances."),
  make_option("--image-id", metavar="ID",
    help="The ID of the image to launch."),
  make_option("-k", "--key-name", metavar="KEY-PAIR",
    help="The key pair to use when launching instances. (Amazon EC2 only.)"),
  make_option("-p", "--user-packages", metavar="PACKAGES",
    help="A space-separated list of packages to install on instances on start \
 up."),
  make_option("-t", "--instance-type", metavar="TYPE",
    help="The type of instance to be launched. One of m1.small, m1.large, \
 m1.xlarge, c1.medium, or c1.xlarge."),
  make_option("-z", "--availability-zone", metavar="ZONE",
    help="The availability zone to run the instances in."),
  make_option("--auto-shutdown", metavar="TIMEOUT_MINUTES",
    help="The time in minutes after launch when an instance will be \
 automatically shut down."),
  make_option("--client-cidr", metavar="CIDR", action="append",
    help="The CIDR of the client, which is used to allow access through the \
 firewall to the master node. (May be specified multiple times.)"),
  make_option("--security-group", metavar="SECURITY_GROUP", action="append",
    default=[], help="Additional security groups within which the instances \
 should be run. (Amazon EC2 only.) (May be specified multiple times.)"),
  make_option("--public-key", metavar="FILE",
    help="The public key to authorize on launching instances. (Non-EC2 \
 providers only.)"),
  make_option("--private-key", metavar="FILE",
    help="The private key to use when connecting to instances. (Non-EC2 \
 providers only.)"),
 ]
 SNAPSHOT_OPTIONS = [
  CONFIG_DIR_OPTION,
  PROVIDER_OPTION,
  make_option("-k", "--key-name", metavar="KEY-PAIR",
    help="The key pair to use when launching instances."),
  make_option("-z", "--availability-zone", metavar="ZONE",
    help="The availability zone to run the instances in."),
  make_option("--ssh-options", metavar="SSH-OPTIONS",
    help="SSH options to use."),
 ]
 PLACEMENT_OPTIONS = [
  CONFIG_DIR_OPTION,
  PROVIDER_OPTION,
  make_option("-z", "--availability-zone", metavar="ZONE",
    help="The availability zone to run the instances in."),
 ]
 FORCE_OPTIONS = [
  CONFIG_DIR_OPTION,
  PROVIDER_OPTION,
  make_option("--force", action="store_true", default=False,
  help="Do not ask for confirmation."),
 ]
 SSH_OPTIONS = [
  CONFIG_DIR_OPTION,
  PROVIDER_OPTION,
  make_option("--ssh-options", metavar="SSH-OPTIONS",
    help="SSH options to use."),
 ]
 def print_usage(script):
  print """Usage: %(script)s COMMAND [OPTIONS]
 where COMMAND and [OPTIONS] may be one of:
  list [CLUSTER]                      list all running Hadoop clusters
                                        or instances in CLUSTER
  launch-master CLUSTER               launch or find a master in CLUSTER
  launch-slaves CLUSTER NUM_SLAVES    launch NUM_SLAVES slaves in CLUSTER
  launch-cluster CLUSTER (NUM_SLAVES| launch a master and NUM_SLAVES slaves or
    N ROLE [N ROLE ...])                N instances in ROLE in CLUSTER
  create-formatted-snapshot CLUSTER   create an empty, formatted snapshot of
    SIZE                                size SIZE GiB
  list-storage CLUSTER                list storage volumes for CLUSTER
  create-storage CLUSTER ROLE         create volumes for NUM_INSTANCES instances
    NUM_INSTANCES SPEC_FILE             in ROLE for CLUSTER, using SPEC_FILE
  attach-storage ROLE                 attach storage volumes for ROLE to CLUSTER
  login CLUSTER                       log in to the master in CLUSTER over SSH
  proxy CLUSTER                       start a SOCKS proxy on localhost into the
                                        CLUSTER
  push CLUSTER FILE                   scp FILE to the master in CLUSTER
  exec CLUSTER CMD                    execute CMD on the master in CLUSTER
  terminate-cluster CLUSTER           terminate all instances in CLUSTER
  delete-cluster CLUSTER              delete the group information for CLUSTER
  delete-storage CLUSTER              delete all storage volumes for CLUSTER
  update-slaves-file CLUSTER          update the slaves file on the CLUSTER
                                        master
 Use %(script)s COMMAND --help to see additional options for specific commands.
 """ % locals()
 def print_deprecation(script, replacement):
  print "Deprecated. Use '%(script)s %(replacement)s'." % locals()
 def parse_options_and_config(command, option_list=[], extra_arguments=(),
                             unbounded_args=False):
  """
  Parse the arguments to command using the given option list, and combine with
  any configuration parameters.
  If unbounded_args is true then there must be at least as many extra arguments
  as specified by extra_arguments (the first argument is always CLUSTER).
  Otherwise there must be exactly the same number of arguments as
  extra_arguments.
  """
  expected_arguments = ["CLUSTER",]
  expected_arguments.extend(extra_arguments)
  (options_dict, args) = parse_options(command, option_list, expected_arguments,
                                       unbounded_args)
  config_dir = get_config_dir(options_dict)
  config_files = [os.path.join(config_dir, CONFIG_FILENAME)]
  if 'config_dir' not in options_dict:
    # if config_dir not set, then also search in current directory
    config_files.insert(0, CONFIG_FILENAME)
  config = ConfigParser.ConfigParser()
  read_files = config.read(config_files)
  logging.debug("Read %d configuration files: %s", len(read_files),
                ", ".join(read_files))
  cluster_name = args[0]
  opt = merge_config_with_options(cluster_name, config, options_dict)
  logging.debug("Options: %s", str(opt))
  service_name = get_service_name(opt)
  cloud_provider = get_cloud_provider(opt)
  cluster = get_cluster(cloud_provider)(cluster_name, config_dir)
  service = get_service(service_name, cloud_provider)(cluster)
  return (opt, args, service)
 def parse_options(command, option_list=[], expected_arguments=(),
                  unbounded_args=False):
  """
  Parse the arguments to command using the given option list.
  If unbounded_args is true then there must be at least as many extra arguments
  as specified by extra_arguments (the first argument is always CLUSTER).
  Otherwise there must be exactly the same number of arguments as
  extra_arguments.
  """
  config_file_name = "%s/%s" % (DEFAULT_CONFIG_DIR_NAME, CONFIG_FILENAME)
  usage = """%%prog %s [options] %s
 Options may also be specified in a configuration file called
 %s located in the user's home directory.
 Options specified on the command line take precedence over any in the
 configuration file.""" % (command, " ".join(expected_arguments),
                          config_file_name)
  parser = OptionParser(usage=usage, version="%%prog %s" % VERSION,
                        option_list=option_list)
  parser.disable_interspersed_args()
  (options, args) = parser.parse_args(sys.argv[2:])
  if unbounded_args:
    if len(args) < len(expected_arguments):
      parser.error("incorrect number of arguments")
  elif len(args) != len(expected_arguments):
    parser.error("incorrect number of arguments")
  return (vars(options), args)
 def get_config_dir(options_dict):
  config_dir = options_dict.get('config_dir')
  if not config_dir:
    config_dir = DEFAULT_CONFIG_DIR
  return config_dir
 def get_service_name(options_dict):
  service_name = options_dict.get("service", None)
  if service_name is None:
    service_name = DEFAULT_SERVICE_NAME
  return service_name
 def get_cloud_provider(options_dict):
  provider = options_dict.get("cloud_provider", None)
  if provider is None:
    provider = DEFAULT_CLOUD_PROVIDER
  return provider
 def check_options_set(options, option_names):
  for option_name in option_names:
    if options.get(option_name) is None:
      print "Option '%s' is missing. Aborting." % option_name
      sys.exit(1)
 def check_launch_options_set(cluster, options):
  if cluster.get_provider_code() == 'ec2':
    if options.get('ami') is None and options.get('image_id') is None:
      print "One of ami or image_id must be specified. Aborting."
      sys.exit(1)
    check_options_set(options, ['key_name'])
  else:
    check_options_set(options, ['image_id', 'public_key'])
 def get_image_id(cluster, options):
  if cluster.get_provider_code() == 'ec2':
    return options.get('image_id', options.get('ami'))
  else:
    return options.get('image_id')
 def main():
  # Use HADOOP_CLOUD_LOGGING_LEVEL=DEBUG to enable debugging output.
  logging.basicConfig(level=getattr(logging,
                                    os.getenv("HADOOP_CLOUD_LOGGING_LEVEL",
                                              "INFO")))
  if len(sys.argv) < 2:
    print_usage(sys.argv[0])
    sys.exit(1)
  command = sys.argv[1]
  if command == 'list':
    (opt, args) = parse_options(command, BASIC_OPTIONS, unbounded_args=True)
    if len(args) == 0:
      service_name = get_service_name(opt)
      cloud_provider = get_cloud_provider(opt)
      service = get_service(service_name, cloud_provider)(None)
      service.list_all(cloud_provider)
    else:
      (opt, args, service) = parse_options_and_config(command, BASIC_OPTIONS)
      service.list()
  elif command == 'launch-master':
    (opt, args, service) = parse_options_and_config(command, LAUNCH_OPTIONS)
    check_launch_options_set(service.cluster, opt)
    config_dir = get_config_dir(opt)
    template = InstanceTemplate((NAMENODE, SECONDARY_NAMENODE, JOBTRACKER), 1,
                         get_image_id(service.cluster, opt),
                         opt.get('instance_type'), opt.get('key_name'),
                         opt.get('public_key'), opt.get('private_key'),
                         opt.get('user_data_file'),
                         opt.get('availability_zone'), opt.get('user_packages'),
                         opt.get('auto_shutdown'), opt.get('env'),
                         opt.get('security_group'))
    service.launch_master(template, config_dir, opt.get('client_cidr'))
  elif command == 'launch-slaves':
    (opt, args, service) = parse_options_and_config(command, LAUNCH_OPTIONS,
                                                    ("NUM_SLAVES",))
    number_of_slaves = int(args[1])
    check_launch_options_set(service.cluster, opt)
    template = InstanceTemplate((DATANODE, TASKTRACKER), number_of_slaves,
                         get_image_id(service.cluster, opt),
                         opt.get('instance_type'), opt.get('key_name'),
                         opt.get('public_key'), opt.get('private_key'),
                         opt.get('user_data_file'),
                         opt.get('availability_zone'), opt.get('user_packages'),
                         opt.get('auto_shutdown'), opt.get('env'),
                         opt.get('security_group'))
    service.launch_slaves(template)
  elif command == 'launch-cluster':
    (opt, args, service) = parse_options_and_config(command, LAUNCH_OPTIONS,
                                                    ("NUM_SLAVES",),
                                                    unbounded_args=True)
    check_launch_options_set(service.cluster, opt)
    config_dir = get_config_dir(opt)
    instance_templates = []
    if len(args) == 2:
      number_of_slaves = int(args[1])
      print_deprecation(sys.argv[0], 'launch-cluster %s 1 nn,snn,jt %s dn,tt' %
                        (service.cluster.name, number_of_slaves))
      instance_templates = [
        InstanceTemplate((NAMENODE, SECONDARY_NAMENODE, JOBTRACKER), 1,
                         get_image_id(service.cluster, opt),
                         opt.get('instance_type'), opt.get('key_name'),
                         opt.get('public_key'), opt.get('private_key'),
                         opt.get('user_data_file'),
                         opt.get('availability_zone'), opt.get('user_packages'),
                         opt.get('auto_shutdown'), opt.get('env'),
                         opt.get('security_group')),
        InstanceTemplate((DATANODE, TASKTRACKER), number_of_slaves,
                         get_image_id(service.cluster, opt),
                         opt.get('instance_type'), opt.get('key_name'),
                         opt.get('public_key'), opt.get('private_key'),
                         opt.get('user_data_file'),
                         opt.get('availability_zone'), opt.get('user_packages'),
                         opt.get('auto_shutdown'), opt.get('env'),
                         opt.get('security_group')),
                         ]
    elif len(args) > 2 and len(args) % 2 == 0:
      print_usage(sys.argv[0])
      sys.exit(1)
    else:
      for i in range(len(args) / 2):
        number = int(args[2 * i + 1])
        roles = args[2 * i + 2].split(",")
        instance_templates.append(
          InstanceTemplate(roles, number, get_image_id(service.cluster, opt),
                           opt.get('instance_type'), opt.get('key_name'),
                           opt.get('public_key'), opt.get('private_key'),
                           opt.get('user_data_file'),
                           opt.get('availability_zone'),
                           opt.get('user_packages'),
                           opt.get('auto_shutdown'), opt.get('env'),
                           opt.get('security_group')))
    service.launch_cluster(instance_templates, config_dir,
                           opt.get('client_cidr'))
  elif command == 'login':
    (opt, args, service) = parse_options_and_config(command, SSH_OPTIONS)
    service.login(opt.get('ssh_options'))
  elif command == 'proxy':
    (opt, args, service) = parse_options_and_config(command, SSH_OPTIONS)
    service.proxy(opt.get('ssh_options'))
  elif command == 'push':
    (opt, args, service) = parse_options_and_config(command, SSH_OPTIONS,
                                                    ("FILE",))
    service.push(opt.get('ssh_options'), args[1])
  elif command == 'exec':
    (opt, args, service) = parse_options_and_config(command, SSH_OPTIONS,
                                                    ("CMD",), True)
    service.execute(opt.get('ssh_options'), args[1:])
  elif command == 'terminate-cluster':
    (opt, args, service) = parse_options_and_config(command, FORCE_OPTIONS)
    service.terminate_cluster(opt["force"])
  elif command == 'delete-cluster':
    (opt, args, service) = parse_options_and_config(command, BASIC_OPTIONS)
    service.delete_cluster()
  elif command == 'create-formatted-snapshot':
    (opt, args, service) = parse_options_and_config(command, SNAPSHOT_OPTIONS,
                                                    ("SIZE",))
    size = int(args[1])
    check_options_set(opt, ['availability_zone', 'key_name'])
    ami_ubuntu_intrepid_x86 = 'ami-ec48af85' # use a general AMI
    service.create_formatted_snapshot(size,
                                         opt.get('availability_zone'),
                                         ami_ubuntu_intrepid_x86,
                                         opt.get('key_name'),
                                         xstr(opt.get('ssh_options')))
  elif command == 'list-storage':
    (opt, args, service) = parse_options_and_config(command, BASIC_OPTIONS)
    service.list_storage()
  elif command == 'create-storage':
    (opt, args, service) = parse_options_and_config(command, PLACEMENT_OPTIONS,
                                                    ("ROLE", "NUM_INSTANCES",
                                                     "SPEC_FILE"))
    role = args[1]
    number_of_instances = int(args[2])
    spec_file = args[3]
    check_options_set(opt, ['availability_zone'])
    service.create_storage(role, number_of_instances,
                           opt.get('availability_zone'), spec_file)
  elif command == 'attach-storage':
    (opt, args, service) = parse_options_and_config(command, BASIC_OPTIONS,
                                                    ("ROLE",))
    service.attach_storage(args[1])
  elif command == 'delete-storage':
    (opt, args, service) = parse_options_and_config(command, FORCE_OPTIONS)
    service.delete_storage(opt["force"])
  elif command == 'update-slaves-file':
    (opt, args, service) = parse_options_and_config(command, SSH_OPTIONS)
    check_options_set(opt, ['private_key'])
    ssh_options = xstr(opt.get('ssh_options'))
    config_dir = get_config_dir(opt)
    service.update_slaves_file(config_dir, ssh_options, opt.get('private_key'))
  else:
    print "Unrecognized command '%s'" % command
    print_usage(sys.argv[0])
    sys.exit(1)
--- a/src/contrib/cloud/src/py/hadoop/cloud/cluster.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/cluster.py
@ -1,187 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Classes for controlling a cluster of cloud instances.
 """
 from __future__ import with_statement
 import gzip
 import StringIO
 import urllib
 from hadoop.cloud.storage import Storage
 CLUSTER_PROVIDER_MAP = {
  "dummy": ('hadoop.cloud.providers.dummy', 'DummyCluster'),
  "ec2": ('hadoop.cloud.providers.ec2', 'Ec2Cluster'),
  "rackspace": ('hadoop.cloud.providers.rackspace', 'RackspaceCluster'),
 }
 def get_cluster(provider):
  """
  Retrieve the Cluster class for a provider.
  """
  mod_name, driver_name = CLUSTER_PROVIDER_MAP[provider]
  _mod = __import__(mod_name, globals(), locals(), [driver_name])
  return getattr(_mod, driver_name)
 class Cluster(object):
  """
  A cluster of server instances. A cluster has a unique name.
  One may launch instances which run in a certain role.
  """
  def __init__(self, name, config_dir):
    self.name = name
    self.config_dir = config_dir
  def get_provider_code(self):
    """
    The code that uniquely identifies the cloud provider.
    """
    raise Exception("Unimplemented")
  def authorize_role(self, role, from_port, to_port, cidr_ip):
    """
    Authorize access to machines in a given role from a given network.
    """
    pass
  def get_instances_in_role(self, role, state_filter=None):
    """
    Get all the instances in a role, filtered by state.
    @param role: the name of the role
    @param state_filter: the state that the instance should be in
    (e.g. "running"), or None for all states
    """
    raise Exception("Unimplemented")
  def print_status(self, roles=None, state_filter="running"):
    """
    Print the status of instances in the given roles, filtered by state.
    """
    pass
  def check_running(self, role, number):
    """
    Check that a certain number of instances in a role are running.
    """
    instances = self.get_instances_in_role(role, "running")
    if len(instances) != number:
      print "Expected %s instances in role %s, but was %s %s" % \
        (number, role, len(instances), instances)
      return False
    else:
      return instances
  def launch_instances(self, roles, number, image_id, size_id,
                       instance_user_data, **kwargs):
    """
    Launch instances (having the given roles) in the cluster.
    Returns a list of IDs for the instances started.
    """
    pass
  def wait_for_instances(self, instance_ids, timeout=600):
    """
    Wait for instances to start.
    Raise TimeoutException if the timeout is exceeded.
    """
    pass
  def terminate(self):
    """
    Terminate all instances in the cluster.
    """
    pass
  def delete(self):
    """
    Delete the cluster permanently. This operation is only permitted if no
    instances are running.
    """
    pass
  def get_storage(self):
    """
    Return the external storage for the cluster.
    """
    return Storage(self)
 class InstanceUserData(object):
  """
  The data passed to an instance on start up.
  """
  def __init__(self, filename, replacements={}):
    self.filename = filename
    self.replacements = replacements
  def _read_file(self, filename):
    """
    Read the user data.
    """
    return urllib.urlopen(filename).read()
  def read(self):
    """
    Read the user data, making replacements.
    """
    contents = self._read_file(self.filename)
    for (match, replacement) in self.replacements.iteritems():
      if replacement == None:
        replacement = ''
      contents = contents.replace(match, replacement)
    return contents
  def read_as_gzip_stream(self):
    """
    Read and compress the data.
    """
    output = StringIO.StringIO()
    compressed = gzip.GzipFile(mode='wb', fileobj=output)
    compressed.write(self.read())
    compressed.close()
    return output.getvalue()
 class Instance(object):
  """
  A server instance.
  """
  def __init__(self, id, public_ip, private_ip):
    self.id = id
    self.public_ip = public_ip
    self.private_ip = private_ip
 class RoleSyntaxException(Exception):
  """
  Raised when a role name is invalid. Role names may consist of a sequence
  of alphanumeric characters and underscores. Dashes are not permitted in role
  names.
  """
  def __init__(self, message):
    super(RoleSyntaxException, self).__init__()
    self.message = message
  def __str__(self):
    return repr(self.message)
 class TimeoutException(Exception):
  """
  Raised when a timeout is exceeded.
  """
  pass
--- a/src/contrib/cloud/src/py/hadoop/cloud/data/boot-rackspace.sh
+++ b/src/contrib/cloud/src/py/hadoop/cloud/data/boot-rackspace.sh
@ -1,459 +0,0 @@
 #!/bin/bash -x
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ################################################################################
 # Script that is run on each instance on boot.
 ################################################################################
 ################################################################################
 # Initialize variables
 ################################################################################
 SELF_HOST=`/sbin/ifconfig eth0 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}'`
 HADOOP_VERSION=${HADOOP_VERSION:-0.20.1}
 HADOOP_HOME=/usr/local/hadoop-$HADOOP_VERSION
 HADOOP_CONF_DIR=$HADOOP_HOME/conf
 for role in $(echo "$ROLES" | tr "," "\n"); do
  case $role in
  nn)
    NN_HOST=$SELF_HOST
    ;;
  jt)
    JT_HOST=$SELF_HOST
    ;;
  esac
 done
 function register_auto_shutdown() {
  if [ ! -z "$AUTO_SHUTDOWN" ]; then
    shutdown -h +$AUTO_SHUTDOWN >/dev/null &
  fi
 }
 function update_repo() {
  if which dpkg &> /dev/null; then
    sudo apt-get update
  elif which rpm &> /dev/null; then
    yum update -y yum
  fi
 }
 # Install a list of packages on debian or redhat as appropriate
 function install_packages() {
  if which dpkg &> /dev/null; then
    apt-get update
    apt-get -y install $@
  elif which rpm &> /dev/null; then
    yum install -y $@
  else
    echo "No package manager found."
  fi
 }
 # Install any user packages specified in the USER_PACKAGES environment variable
 function install_user_packages() {
  if [ ! -z "$USER_PACKAGES" ]; then
    install_packages $USER_PACKAGES
  fi
 }
 function install_hadoop() {
  useradd hadoop
  hadoop_tar_url=http://s3.amazonaws.com/hadoop-releases/core/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
  hadoop_tar_file=`basename $hadoop_tar_url`
  hadoop_tar_md5_file=`basename $hadoop_tar_url.md5`
  curl="curl --retry 3 --silent --show-error --fail"
  for i in `seq 1 3`;
  do
    $curl -O $hadoop_tar_url
    $curl -O $hadoop_tar_url.md5
    if md5sum -c $hadoop_tar_md5_file; then
      break;
    else
      rm -f $hadoop_tar_file $hadoop_tar_md5_file
    fi
  done
  if [ ! -e $hadoop_tar_file ]; then
    echo "Failed to download $hadoop_tar_url. Aborting."
    exit 1
  fi
  tar zxf $hadoop_tar_file -C /usr/local
  rm -f $hadoop_tar_file $hadoop_tar_md5_file
  echo "export HADOOP_HOME=$HADOOP_HOME" >> ~root/.bashrc
  echo 'export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH' >> ~root/.bashrc
 }
 function prep_disk() {
  mount=$1
  device=$2
  automount=${3:-false}
  echo "warning: ERASING CONTENTS OF $device"
  mkfs.xfs -f $device
  if [ ! -e $mount ]; then
    mkdir $mount
  fi
  mount -o defaults,noatime $device $mount
  if $automount ; then
    echo "$device $mount xfs defaults,noatime 0 0" >> /etc/fstab
  fi
 }
 function wait_for_mount {
  mount=$1
  device=$2
  mkdir $mount
  i=1
  echo "Attempting to mount $device"
  while true ; do
    sleep 10
    echo -n "$i "
    i=$[$i+1]
    mount -o defaults,noatime $device $mount || continue
    echo " Mounted."
    break;
  done
 }
 function make_hadoop_dirs {
  for mount in "$@"; do
    if [ ! -e $mount/hadoop ]; then
      mkdir -p $mount/hadoop
      chown hadoop:hadoop $mount/hadoop
    fi
  done
 }
 # Configure Hadoop by setting up disks and site file
 function configure_hadoop() {
  MOUNT=/data
  FIRST_MOUNT=$MOUNT
  DFS_NAME_DIR=$MOUNT/hadoop/hdfs/name
  FS_CHECKPOINT_DIR=$MOUNT/hadoop/hdfs/secondary
  DFS_DATA_DIR=$MOUNT/hadoop/hdfs/data
  MAPRED_LOCAL_DIR=$MOUNT/hadoop/mapred/local
  MAX_MAP_TASKS=2
  MAX_REDUCE_TASKS=1
  CHILD_OPTS=-Xmx550m
  CHILD_ULIMIT=1126400
  TMP_DIR=$MOUNT/tmp/hadoop-\${user.name}
  mkdir -p $MOUNT/hadoop
  chown hadoop:hadoop $MOUNT/hadoop
  mkdir $MOUNT/tmp
  chmod a+rwxt $MOUNT/tmp
  mkdir /etc/hadoop
  ln -s $HADOOP_CONF_DIR /etc/hadoop/conf
  ##############################################################################
  # Modify this section to customize your Hadoop cluster.
  ##############################################################################
  cat > $HADOOP_CONF_DIR/hadoop-site.xml <<EOF
 <?xml version="1.0"?>
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 <configuration>
 <property>
  <name>dfs.block.size</name>
  <value>134217728</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.data.dir</name>
  <value>$DFS_DATA_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.datanode.du.reserved</name>
  <value>1073741824</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.datanode.handler.count</name>
  <value>3</value>
  <final>true</final>
 </property>
 <!--property>
  <name>dfs.hosts</name>
  <value>$HADOOP_CONF_DIR/dfs.hosts</value>
  <final>true</final>
 </property-->
 <!--property>
  <name>dfs.hosts.exclude</name>
  <value>$HADOOP_CONF_DIR/dfs.hosts.exclude</value>
  <final>true</final>
 </property-->
 <property>
  <name>dfs.name.dir</name>
  <value>$DFS_NAME_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.namenode.handler.count</name>
  <value>5</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.permissions</name>
  <value>true</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.replication</name>
  <value>$DFS_REPLICATION</value>
 </property>
 <property>
  <name>fs.checkpoint.dir</name>
  <value>$FS_CHECKPOINT_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://$NN_HOST:8020/</value>
 </property>
 <property>
  <name>fs.trash.interval</name>
  <value>1440</value>
  <final>true</final>
 </property>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/data/tmp/hadoop-\${user.name}</value>
  <final>true</final>
 </property>
 <property>
  <name>io.file.buffer.size</name>
  <value>65536</value>
 </property>
 <property>
  <name>mapred.child.java.opts</name>
  <value>$CHILD_OPTS</value>
 </property>
 <property>
  <name>mapred.child.ulimit</name>
  <value>$CHILD_ULIMIT</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.job.tracker</name>
  <value>$JT_HOST:8021</value>
 </property>
 <property>
  <name>mapred.job.tracker.handler.count</name>
  <value>5</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.local.dir</name>
  <value>$MAPRED_LOCAL_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.map.tasks.speculative.execution</name>
  <value>true</value>
 </property>
 <property>
  <name>mapred.reduce.parallel.copies</name>
  <value>10</value>
 </property>
 <property>
  <name>mapred.reduce.tasks</name>
  <value>10</value>
 </property>
 <property>
  <name>mapred.reduce.tasks.speculative.execution</name>
  <value>false</value>
 </property>
 <property>
  <name>mapred.submit.replication</name>
  <value>10</value>
 </property>
 <property>
  <name>mapred.system.dir</name>
  <value>/hadoop/system/mapred</value>
 </property>
 <property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>$MAX_MAP_TASKS</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>$MAX_REDUCE_TASKS</value>
  <final>true</final>
 </property>
 <property>
  <name>tasktracker.http.threads</name>
  <value>46</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
 </property>
 <property>
  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
 </property>
 <property>
  <name>hadoop.rpc.socket.factory.class.default</name>
  <value>org.apache.hadoop.net.StandardSocketFactory</value>
  <final>true</final>
 </property>
 <property>
  <name>hadoop.rpc.socket.factory.class.ClientProtocol</name>
  <value></value>
  <final>true</final>
 </property>
 <property>
  <name>hadoop.rpc.socket.factory.class.JobSubmissionProtocol</name>
  <value></value>
  <final>true</final>
 </property>
 <property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec</value>
 </property>
 </configuration>
 EOF
  # Keep PID files in a non-temporary directory
  sed -i -e "s|# export HADOOP_PID_DIR=.*|export HADOOP_PID_DIR=/var/run/hadoop|" \
    $HADOOP_CONF_DIR/hadoop-env.sh
  mkdir -p /var/run/hadoop
  chown -R hadoop:hadoop /var/run/hadoop
  # Set SSH options within the cluster
  sed -i -e 's|# export HADOOP_SSH_OPTS=.*|export HADOOP_SSH_OPTS="-o StrictHostKeyChecking=no"|' \
    $HADOOP_CONF_DIR/hadoop-env.sh
  # Disable IPv6
  sed -i -e 's|# export HADOOP_OPTS=.*|export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true"|' \
    $HADOOP_CONF_DIR/hadoop-env.sh
  # Hadoop logs should be on the /mnt partition
  sed -i -e 's|# export HADOOP_LOG_DIR=.*|export HADOOP_LOG_DIR=/var/log/hadoop/logs|' \
    $HADOOP_CONF_DIR/hadoop-env.sh
  rm -rf /var/log/hadoop
  mkdir /data/hadoop/logs
  chown hadoop:hadoop /data/hadoop/logs
  ln -s /data/hadoop/logs /var/log/hadoop
  chown -R hadoop:hadoop /var/log/hadoop
 }
 # Sets up small website on cluster.
 function setup_web() {
  if which dpkg &> /dev/null; then
    apt-get -y install thttpd
    WWW_BASE=/var/www
  elif which rpm &> /dev/null; then
    yum install -y thttpd
    chkconfig --add thttpd
    WWW_BASE=/var/www/thttpd/html
  fi
  cat > $WWW_BASE/index.html << END
 <html>
 <head>
 <title>Hadoop Cloud Cluster</title>
 </head>
 <body>
 <h1>Hadoop Cloud Cluster</h1>
 To browse the cluster you need to have a proxy configured.
 Start the proxy with <tt>hadoop-cloud proxy &lt;cluster_name&gt;</tt>,
 and point your browser to
 <a href="http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac">this Proxy
 Auto-Configuration (PAC)</a> file.  To manage multiple proxy configurations,
 you may wish to use
 <a href="https://addons.mozilla.org/en-US/firefox/addon/2464">FoxyProxy</a>.
 <ul>
 <li><a href="http://$NN_HOST:50070/">NameNode</a>
 <li><a href="http://$JT_HOST:50030/">JobTracker</a>
 </ul>
 </body>
 </html>
 END
  service thttpd start
 }
 function start_namenode() {
  if which dpkg &> /dev/null; then
    AS_HADOOP="su -s /bin/bash - hadoop -c"
  elif which rpm &> /dev/null; then
    AS_HADOOP="/sbin/runuser -s /bin/bash - hadoop -c"
  fi
  # Format HDFS
  [ ! -e $FIRST_MOUNT/hadoop/hdfs ] && $AS_HADOOP "$HADOOP_HOME/bin/hadoop namenode -format"
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop-daemon.sh start namenode"
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop dfsadmin -safemode wait"
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop fs -mkdir /user"
  # The following is questionable, as it allows a user to delete another user
  # It's needed to allow users to create their own user directories
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop fs -chmod +w /user"
 }
 function start_daemon() {
  if which dpkg &> /dev/null; then
    AS_HADOOP="su -s /bin/bash - hadoop -c"
  elif which rpm &> /dev/null; then
    AS_HADOOP="/sbin/runuser -s /bin/bash - hadoop -c"
  fi
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop-daemon.sh start $1"
 }
 update_repo
 register_auto_shutdown
 install_user_packages
 install_hadoop
 configure_hadoop
 for role in $(echo "$ROLES" | tr "," "\n"); do
  case $role in
  nn)
    setup_web
    start_namenode
    ;;
  snn)
    start_daemon secondarynamenode
    ;;
  jt)
    start_daemon jobtracker
    ;;
  dn)
    start_daemon datanode
    ;;
  tt)
    start_daemon tasktracker
    ;;
  esac
 done
--- a/src/contrib/cloud/src/py/hadoop/cloud/data/hadoop-ec2-init-remote.sh
+++ b/src/contrib/cloud/src/py/hadoop/cloud/data/hadoop-ec2-init-remote.sh
@ -1,548 +0,0 @@
 #!/bin/bash -x
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ################################################################################
 # Script that is run on each EC2 instance on boot. It is passed in the EC2 user
 # data, so should not exceed 16K in size after gzip compression.
 #
 # This script is executed by /etc/init.d/ec2-run-user-data, and output is
 # logged to /var/log/messages.
 ################################################################################
 ################################################################################
 # Initialize variables
 ################################################################################
 # Substitute environment variables passed by the client
 export %ENV%
 HADOOP_VERSION=${HADOOP_VERSION:-0.20.1}
 HADOOP_HOME=/usr/local/hadoop-$HADOOP_VERSION
 HADOOP_CONF_DIR=$HADOOP_HOME/conf
 SELF_HOST=`wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname`
 for role in $(echo "$ROLES" | tr "," "\n"); do
  case $role in
  nn)
    NN_HOST=$SELF_HOST
    ;;
  jt)
    JT_HOST=$SELF_HOST
    ;;
  esac
 done
 function register_auto_shutdown() {
  if [ ! -z "$AUTO_SHUTDOWN" ]; then
    shutdown -h +$AUTO_SHUTDOWN >/dev/null &
  fi
 }
 # Install a list of packages on debian or redhat as appropriate
 function install_packages() {
  if which dpkg &> /dev/null; then
    apt-get update
    apt-get -y install $@
  elif which rpm &> /dev/null; then
    yum install -y $@
  else
    echo "No package manager found."
  fi
 }
 # Install any user packages specified in the USER_PACKAGES environment variable
 function install_user_packages() {
  if [ ! -z "$USER_PACKAGES" ]; then
    install_packages $USER_PACKAGES
  fi
 }
 function install_hadoop() {
  useradd hadoop
  hadoop_tar_url=http://s3.amazonaws.com/hadoop-releases/core/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz
  hadoop_tar_file=`basename $hadoop_tar_url`
  hadoop_tar_md5_file=`basename $hadoop_tar_url.md5`
  curl="curl --retry 3 --silent --show-error --fail"
  for i in `seq 1 3`;
  do
    $curl -O $hadoop_tar_url
    $curl -O $hadoop_tar_url.md5
    if md5sum -c $hadoop_tar_md5_file; then
      break;
    else
      rm -f $hadoop_tar_file $hadoop_tar_md5_file
    fi
  done
  if [ ! -e $hadoop_tar_file ]; then
    echo "Failed to download $hadoop_tar_url. Aborting."
    exit 1
  fi
  tar zxf $hadoop_tar_file -C /usr/local
  rm -f $hadoop_tar_file $hadoop_tar_md5_file
  echo "export HADOOP_HOME=$HADOOP_HOME" >> ~root/.bashrc
  echo 'export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH' >> ~root/.bashrc
 }
 function prep_disk() {
  mount=$1
  device=$2
  automount=${3:-false}
  echo "warning: ERASING CONTENTS OF $device"
  mkfs.xfs -f $device
  if [ ! -e $mount ]; then
    mkdir $mount
  fi
  mount -o defaults,noatime $device $mount
  if $automount ; then
    echo "$device $mount xfs defaults,noatime 0 0" >> /etc/fstab
  fi
 }
 function wait_for_mount {
  mount=$1
  device=$2
  mkdir $mount
  i=1
  echo "Attempting to mount $device"
  while true ; do
    sleep 10
    echo -n "$i "
    i=$[$i+1]
    mount -o defaults,noatime $device $mount || continue
    echo " Mounted."
    break;
  done
 }
 function make_hadoop_dirs {
  for mount in "$@"; do
    if [ ! -e $mount/hadoop ]; then
      mkdir -p $mount/hadoop
      chown hadoop:hadoop $mount/hadoop
    fi
  done
 }
 # Configure Hadoop by setting up disks and site file
 function configure_hadoop() {
  install_packages xfsprogs # needed for XFS
  INSTANCE_TYPE=`wget -q -O - http://169.254.169.254/latest/meta-data/instance-type`
  if [ -n "$EBS_MAPPINGS" ]; then
    # EBS_MAPPINGS is like "/ebs1,/dev/sdj;/ebs2,/dev/sdk"
    DFS_NAME_DIR=''
    FS_CHECKPOINT_DIR=''
    DFS_DATA_DIR=''
    for mapping in $(echo "$EBS_MAPPINGS" | tr ";" "\n"); do
      # Split on the comma (see "Parameter Expansion" in the bash man page)
      mount=${mapping%,*}
      device=${mapping#*,}
      wait_for_mount $mount $device
      DFS_NAME_DIR=${DFS_NAME_DIR},"$mount/hadoop/hdfs/name"
      FS_CHECKPOINT_DIR=${FS_CHECKPOINT_DIR},"$mount/hadoop/hdfs/secondary"
      DFS_DATA_DIR=${DFS_DATA_DIR},"$mount/hadoop/hdfs/data"
      FIRST_MOUNT=${FIRST_MOUNT-$mount}
      make_hadoop_dirs $mount
    done
    # Remove leading commas
    DFS_NAME_DIR=${DFS_NAME_DIR#?}
    FS_CHECKPOINT_DIR=${FS_CHECKPOINT_DIR#?}
    DFS_DATA_DIR=${DFS_DATA_DIR#?}
    DFS_REPLICATION=3 # EBS is internally replicated, but we also use HDFS replication for safety
  else
    case $INSTANCE_TYPE in
    m1.xlarge|c1.xlarge)
      DFS_NAME_DIR=/mnt/hadoop/hdfs/name,/mnt2/hadoop/hdfs/name
      FS_CHECKPOINT_DIR=/mnt/hadoop/hdfs/secondary,/mnt2/hadoop/hdfs/secondary
      DFS_DATA_DIR=/mnt/hadoop/hdfs/data,/mnt2/hadoop/hdfs/data,/mnt3/hadoop/hdfs/data,/mnt4/hadoop/hdfs/data
      ;;
    m1.large)
      DFS_NAME_DIR=/mnt/hadoop/hdfs/name,/mnt2/hadoop/hdfs/name
      FS_CHECKPOINT_DIR=/mnt/hadoop/hdfs/secondary,/mnt2/hadoop/hdfs/secondary
      DFS_DATA_DIR=/mnt/hadoop/hdfs/data,/mnt2/hadoop/hdfs/data
      ;;
    *)
      # "m1.small" or "c1.medium"
      DFS_NAME_DIR=/mnt/hadoop/hdfs/name
      FS_CHECKPOINT_DIR=/mnt/hadoop/hdfs/secondary
      DFS_DATA_DIR=/mnt/hadoop/hdfs/data
      ;;
    esac
    FIRST_MOUNT=/mnt
    DFS_REPLICATION=3
  fi
  case $INSTANCE_TYPE in
  m1.xlarge|c1.xlarge)
    prep_disk /mnt2 /dev/sdc true &
    disk2_pid=$!
    prep_disk /mnt3 /dev/sdd true &
    disk3_pid=$!
    prep_disk /mnt4 /dev/sde true &
    disk4_pid=$!
    wait $disk2_pid $disk3_pid $disk4_pid
    MAPRED_LOCAL_DIR=/mnt/hadoop/mapred/local,/mnt2/hadoop/mapred/local,/mnt3/hadoop/mapred/local,/mnt4/hadoop/mapred/local
    MAX_MAP_TASKS=8
    MAX_REDUCE_TASKS=4
    CHILD_OPTS=-Xmx680m
    CHILD_ULIMIT=1392640
    ;;
  m1.large)
    prep_disk /mnt2 /dev/sdc true
    MAPRED_LOCAL_DIR=/mnt/hadoop/mapred/local,/mnt2/hadoop/mapred/local
    MAX_MAP_TASKS=4
    MAX_REDUCE_TASKS=2
    CHILD_OPTS=-Xmx1024m
    CHILD_ULIMIT=2097152
    ;;
  c1.medium)
    MAPRED_LOCAL_DIR=/mnt/hadoop/mapred/local
    MAX_MAP_TASKS=4
    MAX_REDUCE_TASKS=2
    CHILD_OPTS=-Xmx550m
    CHILD_ULIMIT=1126400
    ;;
  *)
    # "m1.small"
    MAPRED_LOCAL_DIR=/mnt/hadoop/mapred/local
    MAX_MAP_TASKS=2
    MAX_REDUCE_TASKS=1
    CHILD_OPTS=-Xmx550m
    CHILD_ULIMIT=1126400
    ;;
  esac
  make_hadoop_dirs `ls -d /mnt*`
  # Create tmp directory
  mkdir /mnt/tmp
  chmod a+rwxt /mnt/tmp
  mkdir /etc/hadoop
  ln -s $HADOOP_CONF_DIR /etc/hadoop/conf
  ##############################################################################
  # Modify this section to customize your Hadoop cluster.
  ##############################################################################
  cat > $HADOOP_CONF_DIR/hadoop-site.xml <<EOF
 <?xml version="1.0"?>
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
 <configuration>
 <property>
  <name>dfs.block.size</name>
  <value>134217728</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.data.dir</name>
  <value>$DFS_DATA_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.datanode.du.reserved</name>
  <value>1073741824</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.datanode.handler.count</name>
  <value>3</value>
  <final>true</final>
 </property>
 <!--property>
  <name>dfs.hosts</name>
  <value>$HADOOP_CONF_DIR/dfs.hosts</value>
  <final>true</final>
 </property-->
 <!--property>
  <name>dfs.hosts.exclude</name>
  <value>$HADOOP_CONF_DIR/dfs.hosts.exclude</value>
  <final>true</final>
 </property-->
 <property>
  <name>dfs.name.dir</name>
  <value>$DFS_NAME_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.namenode.handler.count</name>
  <value>5</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.permissions</name>
  <value>true</value>
  <final>true</final>
 </property>
 <property>
  <name>dfs.replication</name>
  <value>$DFS_REPLICATION</value>
 </property>
 <property>
  <name>fs.checkpoint.dir</name>
  <value>$FS_CHECKPOINT_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>fs.default.name</name>
  <value>hdfs://$NN_HOST:8020/</value>
 </property>
 <property>
  <name>fs.trash.interval</name>
  <value>1440</value>
  <final>true</final>
 </property>
 <property>
  <name>hadoop.tmp.dir</name>
  <value>/mnt/tmp/hadoop-\${user.name}</value>
  <final>true</final>
 </property>
 <property>
  <name>io.file.buffer.size</name>
  <value>65536</value>
 </property>
 <property>
  <name>mapred.child.java.opts</name>
  <value>$CHILD_OPTS</value>
 </property>
 <property>
  <name>mapred.child.ulimit</name>
  <value>$CHILD_ULIMIT</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.job.tracker</name>
  <value>$JT_HOST:8021</value>
 </property>
 <property>
  <name>mapred.job.tracker.handler.count</name>
  <value>5</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.local.dir</name>
  <value>$MAPRED_LOCAL_DIR</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.map.tasks.speculative.execution</name>
  <value>true</value>
 </property>
 <property>
  <name>mapred.reduce.parallel.copies</name>
  <value>10</value>
 </property>
 <property>
  <name>mapred.reduce.tasks</name>
  <value>10</value>
 </property>
 <property>
  <name>mapred.reduce.tasks.speculative.execution</name>
  <value>false</value>
 </property>
 <property>
  <name>mapred.submit.replication</name>
  <value>10</value>
 </property>
 <property>
  <name>mapred.system.dir</name>
  <value>/hadoop/system/mapred</value>
 </property>
 <property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>$MAX_MAP_TASKS</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>$MAX_REDUCE_TASKS</value>
  <final>true</final>
 </property>
 <property>
  <name>tasktracker.http.threads</name>
  <value>46</value>
  <final>true</final>
 </property>
 <property>
  <name>mapred.compress.map.output</name>
  <value>true</value>
 </property>
 <property>
  <name>mapred.output.compression.type</name>
  <value>BLOCK</value>
 </property>
 <property>
  <name>hadoop.rpc.socket.factory.class.default</name>
  <value>org.apache.hadoop.net.StandardSocketFactory</value>
  <final>true</final>
 </property>
 <property>
  <name>hadoop.rpc.socket.factory.class.ClientProtocol</name>
  <value></value>
  <final>true</final>
 </property>
 <property>
  <name>hadoop.rpc.socket.factory.class.JobSubmissionProtocol</name>
  <value></value>
  <final>true</final>
 </property>
 <property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec</value>
 </property>
 <property>
  <name>fs.s3.awsAccessKeyId</name>
  <value>$AWS_ACCESS_KEY_ID</value>
 </property>
 <property>
  <name>fs.s3.awsSecretAccessKey</name>
  <value>$AWS_SECRET_ACCESS_KEY</value>
 </property>
 <property>
  <name>fs.s3n.awsAccessKeyId</name>
  <value>$AWS_ACCESS_KEY_ID</value>
 </property>
 <property>
  <name>fs.s3n.awsSecretAccessKey</name>
  <value>$AWS_SECRET_ACCESS_KEY</value>
 </property>
 </configuration>
 EOF
  # Keep PID files in a non-temporary directory
  sed -i -e "s|# export HADOOP_PID_DIR=.*|export HADOOP_PID_DIR=/var/run/hadoop|" \
    $HADOOP_CONF_DIR/hadoop-env.sh
  mkdir -p /var/run/hadoop
  chown -R hadoop:hadoop /var/run/hadoop
  # Set SSH options within the cluster
  sed -i -e 's|# export HADOOP_SSH_OPTS=.*|export HADOOP_SSH_OPTS="-o StrictHostKeyChecking=no"|' \
    $HADOOP_CONF_DIR/hadoop-env.sh
  # Hadoop logs should be on the /mnt partition
  sed -i -e 's|# export HADOOP_LOG_DIR=.*|export HADOOP_LOG_DIR=/var/log/hadoop/logs|' \
    $HADOOP_CONF_DIR/hadoop-env.sh
  rm -rf /var/log/hadoop
  mkdir /mnt/hadoop/logs
  chown hadoop:hadoop /mnt/hadoop/logs
  ln -s /mnt/hadoop/logs /var/log/hadoop
  chown -R hadoop:hadoop /var/log/hadoop
 }
 # Sets up small website on cluster.
 function setup_web() {
  if which dpkg &> /dev/null; then
    apt-get -y install thttpd
    WWW_BASE=/var/www
  elif which rpm &> /dev/null; then
    yum install -y thttpd
    chkconfig --add thttpd
    WWW_BASE=/var/www/thttpd/html
  fi
  cat > $WWW_BASE/index.html << END
 <html>
 <head>
 <title>Hadoop EC2 Cluster</title>
 </head>
 <body>
 <h1>Hadoop EC2 Cluster</h1>
 To browse the cluster you need to have a proxy configured.
 Start the proxy with <tt>hadoop-ec2 proxy &lt;cluster_name&gt;</tt>,
 and point your browser to
 <a href="http://apache-hadoop-ec2.s3.amazonaws.com/proxy.pac">this Proxy
 Auto-Configuration (PAC)</a> file.  To manage multiple proxy configurations,
 you may wish to use
 <a href="https://addons.mozilla.org/en-US/firefox/addon/2464">FoxyProxy</a>.
 <ul>
 <li><a href="http://$NN_HOST:50070/">NameNode</a>
 <li><a href="http://$JT_HOST:50030/">JobTracker</a>
 </ul>
 </body>
 </html>
 END
  service thttpd start
 }
 function start_namenode() {
  if which dpkg &> /dev/null; then
    AS_HADOOP="su -s /bin/bash - hadoop -c"
  elif which rpm &> /dev/null; then
    AS_HADOOP="/sbin/runuser -s /bin/bash - hadoop -c"
  fi
  # Format HDFS
  [ ! -e $FIRST_MOUNT/hadoop/hdfs ] && $AS_HADOOP "$HADOOP_HOME/bin/hadoop namenode -format"
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop-daemon.sh start namenode"
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop dfsadmin -safemode wait"
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop fs -mkdir /user"
  # The following is questionable, as it allows a user to delete another user
  # It's needed to allow users to create their own user directories
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop fs -chmod +w /user"
 }
 function start_daemon() {
  if which dpkg &> /dev/null; then
    AS_HADOOP="su -s /bin/bash - hadoop -c"
  elif which rpm &> /dev/null; then
    AS_HADOOP="/sbin/runuser -s /bin/bash - hadoop -c"
  fi
  $AS_HADOOP "$HADOOP_HOME/bin/hadoop-daemon.sh start $1"
 }
 register_auto_shutdown
 install_user_packages
 install_hadoop
 configure_hadoop
 for role in $(echo "$ROLES" | tr "," "\n"); do
  case $role in
  nn)
    setup_web
    start_namenode
    ;;
  snn)
    start_daemon secondarynamenode
    ;;
  jt)
    start_daemon jobtracker
    ;;
  dn)
    start_daemon datanode
    ;;
  tt)
    start_daemon tasktracker
    ;;
  esac
 done
--- a/src/contrib/cloud/src/py/hadoop/cloud/data/hadoop-rackspace-init-remote.sh
+++ b/src/contrib/cloud/src/py/hadoop/cloud/data/hadoop-rackspace-init-remote.sh
@ -1,22 +0,0 @@
 #!/bin/bash -ex
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Run a script downloaded at boot time to avoid Rackspace's 10K limitation.
 wget -qO/usr/bin/runurl run.alestic.com/runurl
 chmod 755 /usr/bin/runurl
 %ENV% runurl http://hadoop-dev-test.s3.amazonaws.com/boot-rackspace.sh
--- a/src/contrib/cloud/src/py/hadoop/cloud/data/zookeeper-ec2-init-remote.sh
+++ b/src/contrib/cloud/src/py/hadoop/cloud/data/zookeeper-ec2-init-remote.sh
@ -1,112 +0,0 @@
 #!/bin/bash -x
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 ################################################################################
 # Script that is run on each EC2 instance on boot. It is passed in the EC2 user
 # data, so should not exceed 16K in size after gzip compression.
 #
 # This script is executed by /etc/init.d/ec2-run-user-data, and output is
 # logged to /var/log/messages.
 ################################################################################
 ################################################################################
 # Initialize variables
 ################################################################################
 # Substitute environment variables passed by the client
 export %ENV%
 ZK_VERSION=${ZK_VERSION:-3.2.2}
 ZOOKEEPER_HOME=/usr/local/zookeeper-$ZK_VERSION
 ZK_CONF_DIR=/etc/zookeeper/conf
 function register_auto_shutdown() {
  if [ ! -z "$AUTO_SHUTDOWN" ]; then
    shutdown -h +$AUTO_SHUTDOWN >/dev/null &
  fi
 }
 # Install a list of packages on debian or redhat as appropriate
 function install_packages() {
  if which dpkg &> /dev/null; then
    apt-get update
    apt-get -y install $@
  elif which rpm &> /dev/null; then
    yum install -y $@
  else
    echo "No package manager found."
  fi
 }
 # Install any user packages specified in the USER_PACKAGES environment variable
 function install_user_packages() {
  if [ ! -z "$USER_PACKAGES" ]; then
    install_packages $USER_PACKAGES
  fi
 }
 function install_zookeeper() {
  zk_tar_url=http://www.apache.org/dist/hadoop/zookeeper/zookeeper-$ZK_VERSION/zookeeper-$ZK_VERSION.tar.gz
  zk_tar_file=`basename $zk_tar_url`
  zk_tar_md5_file=`basename $zk_tar_url.md5`
  curl="curl --retry 3 --silent --show-error --fail"
  for i in `seq 1 3`;
  do
    $curl -O $zk_tar_url
    $curl -O $zk_tar_url.md5
    if md5sum -c $zk_tar_md5_file; then
      break;
    else
      rm -f $zk_tar_file $zk_tar_md5_file
    fi
  done
  if [ ! -e $zk_tar_file ]; then
    echo "Failed to download $zk_tar_url. Aborting."
    exit 1
  fi
  tar zxf $zk_tar_file -C /usr/local
  rm -f $zk_tar_file $zk_tar_md5_file
  echo "export ZOOKEEPER_HOME=$ZOOKEEPER_HOME" >> ~root/.bashrc
  echo 'export PATH=$JAVA_HOME/bin:$ZOOKEEPER_HOME/bin:$PATH' >> ~root/.bashrc
 }
 function configure_zookeeper() {
  mkdir -p /mnt/zookeeper/logs
  ln -s /mnt/zookeeper/logs /var/log/zookeeper
  mkdir -p /var/log/zookeeper/txlog
  mkdir -p $ZK_CONF_DIR
  cp $ZOOKEEPER_HOME/conf/log4j.properties $ZK_CONF_DIR
  sed -i -e "s|log4j.rootLogger=INFO, CONSOLE|log4j.rootLogger=INFO, ROLLINGFILE|" \
         -e "s|log4j.appender.ROLLINGFILE.File=zookeeper.log|log4j.appender.ROLLINGFILE.File=/var/log/zookeeper/zookeeper.log|" \
      $ZK_CONF_DIR/log4j.properties
  # Ensure ZooKeeper starts on boot
  cat > /etc/rc.local <<EOF
 ZOOCFGDIR=$ZK_CONF_DIR $ZOOKEEPER_HOME/bin/zkServer.sh start > /dev/null 2>&1 &
 EOF
 }
 register_auto_shutdown
 install_user_packages
 install_zookeeper
 configure_zookeeper
--- a/src/contrib/cloud/src/py/hadoop/cloud/providers/init.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/providers/init.py
@ -1,14 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
--- a/src/contrib/cloud/src/py/hadoop/cloud/providers/dummy.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/providers/dummy.py
@ -1,61 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 from hadoop.cloud.cluster import Cluster
 from hadoop.cloud.cluster import Instance
 logger = logging.getLogger(__name__)
 class DummyCluster(Cluster):
  @staticmethod
  def get_clusters_with_role(role, state="running"):
    logger.info("get_clusters_with_role(%s, %s)", role, state)
    return ["dummy-cluster"]
  def __init__(self, name, config_dir):
    super(DummyCluster, self).__init__(name, config_dir)
    logger.info("__init__(%s, %s)", name, config_dir)
  def get_provider_code(self):
    return "dummy"
  def authorize_role(self, role, from_port, to_port, cidr_ip):
    logger.info("authorize_role(%s, %s, %s, %s)", role, from_port, to_port,
                cidr_ip)
  def get_instances_in_role(self, role, state_filter=None):
    logger.info("get_instances_in_role(%s, %s)", role, state_filter)
    return [Instance(1, '127.0.0.1', '127.0.0.1')]
  def print_status(self, roles, state_filter="running"):
    logger.info("print_status(%s, %s)", roles, state_filter)
  def launch_instances(self, role, number, image_id, size_id,
                       instance_user_data, **kwargs):
    logger.info("launch_instances(%s, %s, %s, %s, %s, %s)", role, number,
                image_id, size_id, instance_user_data, str(kwargs))
    return [1]
  def wait_for_instances(self, instance_ids, timeout=600):
    logger.info("wait_for_instances(%s, %s)", instance_ids, timeout)
  def terminate(self):
    logger.info("terminate")
  def delete(self):
    logger.info("delete")
--- a/src/contrib/cloud/src/py/hadoop/cloud/providers/ec2.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/providers/ec2.py
@ -1,479 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from boto.ec2.connection import EC2Connection
 from boto.exception import EC2ResponseError
 import logging
 from hadoop.cloud.cluster import Cluster
 from hadoop.cloud.cluster import Instance
 from hadoop.cloud.cluster import RoleSyntaxException
 from hadoop.cloud.cluster import TimeoutException
 from hadoop.cloud.storage import JsonVolumeManager
 from hadoop.cloud.storage import JsonVolumeSpecManager
 from hadoop.cloud.storage import MountableVolume
 from hadoop.cloud.storage import Storage
 from hadoop.cloud.util import xstr
 import os
 import re
 import subprocess
 import sys
 import time
 logger = logging.getLogger(__name__)
 def _run_command_on_instance(instance, ssh_options, command):
  print "Running ssh %s root@%s '%s'" % \
    (ssh_options, instance.public_dns_name, command)
  retcode = subprocess.call("ssh %s root@%s '%s'" %
                           (ssh_options, instance.public_dns_name, command),
                           shell=True)
  print "Command running on %s returned with value %s" % \
    (instance.public_dns_name, retcode)
 def _wait_for_volume(ec2_connection, volume_id):
  """
  Waits until a volume becomes available.
  """
  while True:
    volumes = ec2_connection.get_all_volumes([volume_id,])
    if volumes[0].status == 'available':
      break
    sys.stdout.write(".")
    sys.stdout.flush()
    time.sleep(1)
 class Ec2Cluster(Cluster):
  """
  A cluster of EC2 instances. A cluster has a unique name.
  Instances running in the cluster run in a security group with the cluster's
  name, and also a name indicating the instance's role, e.g. <cluster-name>-foo
  to show a "foo" instance.
  """
  @staticmethod
  def get_clusters_with_role(role, state="running"):
    all_instances = EC2Connection().get_all_instances()
    clusters = []
    for res in all_instances:
      instance = res.instances[0]
      for group in res.groups:
        if group.id.endswith("-" + role) and instance.state == state:
          clusters.append(re.sub("-%s$" % re.escape(role), "", group.id))
    return clusters
  def __init__(self, name, config_dir):
    super(Ec2Cluster, self).__init__(name, config_dir)
    self.ec2Connection = EC2Connection()
  def get_provider_code(self):
    return "ec2"
  def _get_cluster_group_name(self):
    return self.name
  def _check_role_name(self, role):
    if not re.match("^[a-zA-Z0-9_+]+$", role):
      raise RoleSyntaxException("Invalid role name '%s'" % role)
  def _group_name_for_role(self, role):
    """
    Return the security group name for an instance in a given role.
    """
    self._check_role_name(role)
    return "%s-%s" % (self.name, role)
  def _get_group_names(self, roles):
    group_names = [self._get_cluster_group_name()]
    for role in roles:
      group_names.append(self._group_name_for_role(role))
    return group_names
  def _get_all_group_names(self):
    security_groups = self.ec2Connection.get_all_security_groups()
    security_group_names = \
      [security_group.name for security_group in security_groups]
    return security_group_names
  def _get_all_group_names_for_cluster(self):
    all_group_names = self._get_all_group_names()
    r = []
    if self.name not in all_group_names:
      return r
    for group in all_group_names:
      if re.match("^%s(-[a-zA-Z0-9_+]+)?$" % self.name, group):
        r.append(group)
    return r
  def _create_groups(self, role):
    """
    Create the security groups for a given role, including a group for the
    cluster if it doesn't exist.
    """
    self._check_role_name(role)
    security_group_names = self._get_all_group_names()
    cluster_group_name = self._get_cluster_group_name()
    if not cluster_group_name in security_group_names:
      self.ec2Connection.create_security_group(cluster_group_name,
                                               "Cluster (%s)" % (self.name))
      self.ec2Connection.authorize_security_group(cluster_group_name,
                                                  cluster_group_name)
      # Allow SSH from anywhere
      self.ec2Connection.authorize_security_group(cluster_group_name,
                                                  ip_protocol="tcp",
                                                  from_port=22, to_port=22,
                                                  cidr_ip="0.0.0.0/0")
    role_group_name = self._group_name_for_role(role)
    if not role_group_name in security_group_names:
      self.ec2Connection.create_security_group(role_group_name,
        "Role %s (%s)" % (role, self.name))
  def authorize_role(self, role, from_port, to_port, cidr_ip):
    """
    Authorize access to machines in a given role from a given network.
    """
    self._check_role_name(role)
    role_group_name = self._group_name_for_role(role)
    # Revoke first to avoid InvalidPermission.Duplicate error
    self.ec2Connection.revoke_security_group(role_group_name,
                                             ip_protocol="tcp",
                                             from_port=from_port,
                                             to_port=to_port, cidr_ip=cidr_ip)
    self.ec2Connection.authorize_security_group(role_group_name,
                                                ip_protocol="tcp",
                                                from_port=from_port,
                                                to_port=to_port,
                                                cidr_ip=cidr_ip)
  def _get_instances(self, group_name, state_filter=None):
    """
    Get all the instances in a group, filtered by state.
    @param group_name: the name of the group
    @param state_filter: the state that the instance should be in
      (e.g. "running"), or None for all states
    """
    all_instances = self.ec2Connection.get_all_instances()
    instances = []
    for res in all_instances:
      for group in res.groups:
        if group.id == group_name:
          for instance in res.instances:
            if state_filter == None or instance.state == state_filter:
              instances.append(instance)
    return instances
  def get_instances_in_role(self, role, state_filter=None):
    """
    Get all the instances in a role, filtered by state.
    @param role: the name of the role
    @param state_filter: the state that the instance should be in
      (e.g. "running"), or None for all states
    """
    self._check_role_name(role)
    instances = []
    for instance in self._get_instances(self._group_name_for_role(role),
                                        state_filter):
      instances.append(Instance(instance.id, instance.dns_name,
                                instance.private_dns_name))
    return instances
  def _print_instance(self, role, instance):
    print "\t".join((role, instance.id,
      instance.image_id,
      instance.dns_name, instance.private_dns_name,
      instance.state, xstr(instance.key_name), instance.instance_type,
      str(instance.launch_time), instance.placement))
  def print_status(self, roles=None, state_filter="running"):
    """
    Print the status of instances in the given roles, filtered by state.
    """
    if not roles:
      for instance in self._get_instances(self._get_cluster_group_name(),
                                          state_filter):
        self._print_instance("", instance)
    else:
      for role in roles:
        for instance in self._get_instances(self._group_name_for_role(role),
                                            state_filter):
          self._print_instance(role, instance)
  def launch_instances(self, roles, number, image_id, size_id,
                       instance_user_data, **kwargs):
    for role in roles:
      self._check_role_name(role)  
      self._create_groups(role)
    user_data = instance_user_data.read_as_gzip_stream()
    security_groups = self._get_group_names(roles) + kwargs.get('security_groups', [])
    reservation = self.ec2Connection.run_instances(image_id, min_count=number,
      max_count=number, key_name=kwargs.get('key_name', None),
      security_groups=security_groups, user_data=user_data,
      instance_type=size_id, placement=kwargs.get('placement', None))
    return [instance.id for instance in reservation.instances]
  def wait_for_instances(self, instance_ids, timeout=600):
    start_time = time.time()
    while True:
      if (time.time() - start_time >= timeout):
        raise TimeoutException()
      try:
        if self._all_started(self.ec2Connection.get_all_instances(instance_ids)):
          break
      # don't timeout for race condition where instance is not yet registered
      except EC2ResponseError:
        pass
      sys.stdout.write(".")
      sys.stdout.flush()
      time.sleep(1)
  def _all_started(self, reservations):
    for res in reservations:
      for instance in res.instances:
        if instance.state != "running":
          return False
    return True
  def terminate(self):
    instances = self._get_instances(self._get_cluster_group_name(), "running")
    if instances:
      self.ec2Connection.terminate_instances([i.id for i in instances])
  def delete(self):
    """
    Delete the security groups for each role in the cluster, and the group for
    the cluster.
    """
    group_names = self._get_all_group_names_for_cluster()
    for group in group_names:
      self.ec2Connection.delete_security_group(group)
  def get_storage(self):
    """
    Return the external storage for the cluster.
    """
    return Ec2Storage(self)
 class Ec2Storage(Storage):
  """
  Storage volumes for an EC2 cluster. The storage is associated with a named
  cluster. Metadata for the storage volumes is kept in a JSON file on the client
  machine (in a file called "ec2-storage-<cluster-name>.json" in the
  configuration directory).
  """
  @staticmethod
  def create_formatted_snapshot(cluster, size, availability_zone, image_id,
                                key_name, ssh_options):
    """
    Creates a formatted snapshot of a given size. This saves having to format
    volumes when they are first attached.
    """
    conn = cluster.ec2Connection
    print "Starting instance"
    reservation = conn.run_instances(image_id, key_name=key_name,
                                     placement=availability_zone)
    instance = reservation.instances[0]
    try:
      cluster.wait_for_instances([instance.id,])
      print "Started instance %s" % instance.id
    except TimeoutException:
      print "Timeout"
      return
    print
    print "Waiting 60 seconds before attaching storage"
    time.sleep(60)
    # Re-populate instance object since it has more details filled in
    instance.update()
    print "Creating volume of size %s in %s" % (size, availability_zone)
    volume = conn.create_volume(size, availability_zone)
    print "Created volume %s" % volume
    print "Attaching volume to %s" % instance.id
    volume.attach(instance.id, '/dev/sdj')
    _run_command_on_instance(instance, ssh_options, """
      while true ; do
        echo 'Waiting for /dev/sdj...';
        if [ -e /dev/sdj ]; then break; fi;
        sleep 1;
      done;
      mkfs.ext3 -F -m 0.5 /dev/sdj
    """)
    print "Detaching volume"
    conn.detach_volume(volume.id, instance.id)
    print "Creating snapshot"
    snapshot = volume.create_snapshot()
    print "Created snapshot %s" % snapshot.id
    _wait_for_volume(conn, volume.id)
    print
    print "Deleting volume"
    volume.delete()
    print "Deleted volume"
    print "Stopping instance"
    terminated = conn.terminate_instances([instance.id,])
    print "Stopped instance %s" % terminated
  def __init__(self, cluster):
    super(Ec2Storage, self).__init__(cluster)
    self.config_dir = cluster.config_dir
  def _get_storage_filename(self):
    return os.path.join(self.config_dir,
                        "ec2-storage-%s.json" % (self.cluster.name))
  def create(self, role, number_of_instances, availability_zone, spec_filename):
    spec_file = open(spec_filename, 'r')
    volume_spec_manager = JsonVolumeSpecManager(spec_file)
    volume_manager = JsonVolumeManager(self._get_storage_filename())
    for dummy in range(number_of_instances):
      mountable_volumes = []
      volume_specs = volume_spec_manager.volume_specs_for_role(role)
      for spec in volume_specs:
        logger.info("Creating volume of size %s in %s from snapshot %s" % \
                    (spec.size, availability_zone, spec.snapshot_id))
        volume = self.cluster.ec2Connection.create_volume(spec.size,
                                                          availability_zone,
                                                          spec.snapshot_id)
        mountable_volumes.append(MountableVolume(volume.id, spec.mount_point,
                                                 spec.device))
      volume_manager.add_instance_storage_for_role(role, mountable_volumes)
  def _get_mountable_volumes(self, role):
    storage_filename = self._get_storage_filename()
    volume_manager = JsonVolumeManager(storage_filename)
    return volume_manager.get_instance_storage_for_role(role)
  def get_mappings_string_for_role(self, role):
    mappings = {}
    mountable_volumes_list = self._get_mountable_volumes(role)
    for mountable_volumes in mountable_volumes_list:
      for mountable_volume in mountable_volumes:
        mappings[mountable_volume.mount_point] = mountable_volume.device
    return ";".join(["%s,%s" % (mount_point, device) for (mount_point, device)
                     in mappings.items()])
  def _has_storage(self, role):
    return self._get_mountable_volumes(role)
  def has_any_storage(self, roles):
    for role in roles:
      if self._has_storage(role):
        return True
    return False
  def get_roles(self):
    storage_filename = self._get_storage_filename()
    volume_manager = JsonVolumeManager(storage_filename)
    return volume_manager.get_roles()
  def _get_ec2_volumes_dict(self, mountable_volumes):
    volume_ids = [mv.volume_id for mv in sum(mountable_volumes, [])]
    volumes = self.cluster.ec2Connection.get_all_volumes(volume_ids)
    volumes_dict = {}
    for volume in volumes:
      volumes_dict[volume.id] = volume
    return volumes_dict
  def _print_volume(self, role, volume):
    print "\t".join((role, volume.id, str(volume.size),
                     volume.snapshot_id, volume.availabilityZone,
                     volume.status, str(volume.create_time),
                     str(volume.attach_time)))
  def print_status(self, roles=None):
    if roles == None:
      storage_filename = self._get_storage_filename()
      volume_manager = JsonVolumeManager(storage_filename)
      roles = volume_manager.get_roles()
    for role in roles:
      mountable_volumes_list = self._get_mountable_volumes(role)
      ec2_volumes = self._get_ec2_volumes_dict(mountable_volumes_list)
      for mountable_volumes in mountable_volumes_list:
        for mountable_volume in mountable_volumes:
          self._print_volume(role, ec2_volumes[mountable_volume.volume_id])
  def _replace(self, string, replacements):
    for (match, replacement) in replacements.iteritems():
      string = string.replace(match, replacement)
    return string
  def attach(self, role, instances):
    mountable_volumes_list = self._get_mountable_volumes(role)
    if not mountable_volumes_list:
      return
    ec2_volumes = self._get_ec2_volumes_dict(mountable_volumes_list)
    available_mountable_volumes_list = []
    available_instances_dict = {}
    for instance in instances:
      available_instances_dict[instance.id] = instance
    # Iterate over mountable_volumes and retain those that are not attached
    # Also maintain a list of instances that have no attached storage
    # Note that we do not fill in "holes" (instances that only have some of
    # their storage attached)
    for mountable_volumes in mountable_volumes_list:
      available = True
      for mountable_volume in mountable_volumes:
        if ec2_volumes[mountable_volume.volume_id].status != 'available':
          available = False
          attach_data = ec2_volumes[mountable_volume.volume_id].attach_data
          instance_id = attach_data.instance_id
          if available_instances_dict.has_key(instance_id):
            del available_instances_dict[instance_id]
      if available:
        available_mountable_volumes_list.append(mountable_volumes)
    if len(available_instances_dict) != len(available_mountable_volumes_list):
      logger.warning("Number of available instances (%s) and volumes (%s) \
        do not match." \
        % (len(available_instances_dict),
           len(available_mountable_volumes_list)))
    for (instance, mountable_volumes) in zip(available_instances_dict.values(),
                                             available_mountable_volumes_list):
      print "Attaching storage to %s" % instance.id
      for mountable_volume in mountable_volumes:
        volume = ec2_volumes[mountable_volume.volume_id]
        print "Attaching %s to %s" % (volume.id, instance.id)
        volume.attach(instance.id, mountable_volume.device)
  def delete(self, roles=[]):
    storage_filename = self._get_storage_filename()
    volume_manager = JsonVolumeManager(storage_filename)
    for role in roles:
      mountable_volumes_list = volume_manager.get_instance_storage_for_role(role)
      ec2_volumes = self._get_ec2_volumes_dict(mountable_volumes_list)
      all_available = True
      for volume in ec2_volumes.itervalues():
        if volume.status != 'available':
          all_available = False
          logger.warning("Volume %s is not available.", volume)
      if not all_available:
        logger.warning("Some volumes are still in use for role %s.\
          Aborting delete.", role)
        return
      for volume in ec2_volumes.itervalues():
        volume.delete()
      volume_manager.remove_instance_storage_for_role(role)
--- a/src/contrib/cloud/src/py/hadoop/cloud/providers/rackspace.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/providers/rackspace.py
@ -1,239 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from __future__ import with_statement
 import base64
 import os
 import subprocess
 import sys
 import time
 import uuid
 from hadoop.cloud.cluster import Cluster
 from hadoop.cloud.cluster import Instance
 from hadoop.cloud.cluster import TimeoutException
 from hadoop.cloud.service import HadoopService
 from hadoop.cloud.service import TASKTRACKER
 from libcloud.drivers.rackspace import RackspaceNodeDriver
 from libcloud.base import Node
 from libcloud.base import NodeImage
 RACKSPACE_KEY = os.environ['RACKSPACE_KEY']
 RACKSPACE_SECRET = os.environ['RACKSPACE_SECRET']
 STATE_MAP = { 'running': 'ACTIVE' }
 STATE_MAP_REVERSED = dict((v, k) for k, v in STATE_MAP.iteritems())
 USER_DATA_FILENAME = "/etc/init.d/rackspace-init.sh"
 class RackspaceCluster(Cluster):
  """
  A cluster of instances running on Rackspace Cloud Servers. A cluster has a
  unique name, which is stored under the "cluster" metadata key of each server.
  Every instance in the cluster has one or more roles, stored as a
  comma-separated string under the "roles" metadata key. For example, an instance
  with roles "foo" and "bar" has a "foo,bar" "roles" key.
  At boot time two files are injected into an instance's filesystem: the user
  data file (which is used as a boot script), and the user's public key.
  """
  @staticmethod
  def get_clusters_with_role(role, state="running", driver=None):
    driver = driver or RackspaceNodeDriver(RACKSPACE_KEY, RACKSPACE_SECRET)
    all_nodes = RackspaceCluster._list_nodes(driver)
    clusters = set()
    for node in all_nodes:
      try:
        if node.extra['metadata'].has_key('cluster') and \
            role in node.extra['metadata']['roles'].split(','):
          if node.state == STATE_MAP[state]:
            clusters.add(node.extra['metadata']['cluster'])
      except KeyError:
        pass
    return clusters
  @staticmethod
  def _list_nodes(driver, retries=5):
    attempts = 0
    while True:
      try:
        return driver.list_nodes()
      except IOError:
        attempts = attempts + 1
        if attempts > retries:
          raise
        time.sleep(5)
  def __init__(self, name, config_dir, driver=None):
    super(RackspaceCluster, self).__init__(name, config_dir)
    self.driver = driver or RackspaceNodeDriver(RACKSPACE_KEY, RACKSPACE_SECRET)
  def get_provider_code(self):
    return "rackspace"
  def _get_nodes(self, state_filter=None):
    all_nodes = RackspaceCluster._list_nodes(self.driver)
    nodes = []
    for node in all_nodes:
      try:
        if node.extra['metadata']['cluster'] == self.name:
          if state_filter == None or node.state == STATE_MAP[state_filter]:
            nodes.append(node)
      except KeyError:
        pass
    return nodes
  def _to_instance(self, node):
    return Instance(node.id, node.public_ip[0], node.private_ip[0])
  def _get_nodes_in_role(self, role, state_filter=None):
    all_nodes = RackspaceCluster._list_nodes(self.driver)
    nodes = []
    for node in all_nodes:
      try:
        if node.extra['metadata']['cluster'] == self.name and \
          role in node.extra['metadata']['roles'].split(','):
          if state_filter == None or node.state == STATE_MAP[state_filter]:
            nodes.append(node)
      except KeyError:
        pass
    return nodes
  def get_instances_in_role(self, role, state_filter=None):
    """
    Get all the instances in a role, filtered by state.
    @param role: the name of the role
    @param state_filter: the state that the instance should be in
      (e.g. "running"), or None for all states
    """
    return [self._to_instance(node) for node in \
            self._get_nodes_in_role(role, state_filter)]
  def _print_node(self, node, out):
    out.write("\t".join((node.extra['metadata']['roles'], node.id,
      node.name,
      self._ip_list_to_string(node.public_ip),
      self._ip_list_to_string(node.private_ip),
      STATE_MAP_REVERSED[node.state])))
    out.write("\n")
  def _ip_list_to_string(self, ips):
    if ips is None:
      return ""
    return ",".join(ips)
  def print_status(self, roles=None, state_filter="running", out=sys.stdout):
    if not roles:
      for node in self._get_nodes(state_filter):
        self._print_node(node, out)
    else:
      for role in roles:
        for node in self._get_nodes_in_role(role, state_filter):
          self._print_node(node, out)
  def launch_instances(self, roles, number, image_id, size_id,
                       instance_user_data, **kwargs):
    metadata = {"cluster": self.name, "roles": ",".join(roles)}
    node_ids = []
    files = { USER_DATA_FILENAME: instance_user_data.read() }
    if "public_key" in kwargs:
      files["/root/.ssh/authorized_keys"] = open(kwargs["public_key"]).read()
    for dummy in range(number):
      node = self._launch_instance(roles, image_id, size_id, metadata, files)
      node_ids.append(node.id)
    return node_ids
  def _launch_instance(self, roles, image_id, size_id, metadata, files):
    instance_name = "%s-%s" % (self.name, uuid.uuid4().hex[-8:])
    node = self.driver.create_node(instance_name, self._find_image(image_id),
                                   self._find_size(size_id), metadata=metadata,
                                   files=files)
    return node
  def _find_image(self, image_id):
    return NodeImage(id=image_id, name=None, driver=None)
  def _find_size(self, size_id):
    matches = [i for i in self.driver.list_sizes() if i.id == str(size_id)]
    if len(matches) != 1:
      return None
    return matches[0]
  def wait_for_instances(self, instance_ids, timeout=600):
    start_time = time.time()
    while True:
      if (time.time() - start_time >= timeout):
        raise TimeoutException()
      try:
        if self._all_started(instance_ids):
          break
      except Exception:
        pass
      sys.stdout.write(".")
      sys.stdout.flush()
      time.sleep(1)
  def _all_started(self, node_ids):
    all_nodes = RackspaceCluster._list_nodes(self.driver)
    node_id_to_node = {}
    for node in all_nodes:
      node_id_to_node[node.id] = node
    for node_id in node_ids:
      try:
        if node_id_to_node[node_id].state != STATE_MAP["running"]:
          return False
      except KeyError:
        return False
    return True
  def terminate(self):
    nodes = self._get_nodes("running")
    print nodes
    for node in nodes:
      self.driver.destroy_node(node)
 class RackspaceHadoopService(HadoopService):
  def _update_cluster_membership(self, public_key, private_key):
    """
    Creates a cluster-wide hosts file and copies it across the cluster.
    This is a stop gap until DNS is configured on the cluster. 
    """
    ssh_options = '-o StrictHostKeyChecking=no'
    time.sleep(30) # wait for SSH daemon to start
    nodes = self.cluster._get_nodes('running')
    # create hosts file
    hosts_file = 'hosts'
    with open(hosts_file, 'w') as f:
      f.write("127.0.0.1 localhost localhost.localdomain\n")
      for node in nodes:
        f.write(node.public_ip[0] + "\t" + node.name + "\n")
    # copy to each node in the cluster
    for node in nodes:
      self._call('scp -i %s %s %s root@%s:/etc/hosts' \
                 % (private_key, ssh_options, hosts_file, node.public_ip[0]))
    os.remove(hosts_file)
  def _call(self, command):
    print command
    try:
      subprocess.call(command, shell=True)
    except Exception, e:
      print e
--- a/src/contrib/cloud/src/py/hadoop/cloud/service.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/service.py
@ -1,640 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Classes for running services on a cluster.
 """
 from __future__ import with_statement
 from hadoop.cloud.cluster import get_cluster
 from hadoop.cloud.cluster import InstanceUserData
 from hadoop.cloud.cluster import TimeoutException
 from hadoop.cloud.providers.ec2 import Ec2Storage
 from hadoop.cloud.util import build_env_string
 from hadoop.cloud.util import url_get
 from hadoop.cloud.util import xstr
 import logging
 import os
 import re
 import socket
 import subprocess
 import sys
 import time
 logger = logging.getLogger(__name__)
 MASTER = "master"  # Deprecated.
 NAMENODE = "nn"
 SECONDARY_NAMENODE = "snn"
 JOBTRACKER = "jt"
 DATANODE = "dn"
 TASKTRACKER = "tt"
 class InstanceTemplate(object):
  """
  A template for creating server instances in a cluster.
  """
  def __init__(self, roles, number, image_id, size_id,
                     key_name, public_key, private_key,
                     user_data_file_template=None, placement=None,
                     user_packages=None, auto_shutdown=None, env_strings=[],
                     security_groups=[]):
    self.roles = roles
    self.number = number
    self.image_id = image_id
    self.size_id = size_id
    self.key_name = key_name
    self.public_key = public_key
    self.private_key = private_key
    self.user_data_file_template = user_data_file_template
    self.placement = placement
    self.user_packages = user_packages
    self.auto_shutdown = auto_shutdown
    self.env_strings = env_strings
    self.security_groups = security_groups
  def add_env_strings(self, env_strings):
    new_env_strings = list(self.env_strings or [])
    new_env_strings.extend(env_strings)
    self.env_strings = new_env_strings
 class Service(object):
  """
  A general service that runs on a cluster.
  """
  def __init__(self, cluster):
    self.cluster = cluster
  def get_service_code(self):
    """
    The code that uniquely identifies the service.
    """
    raise Exception("Unimplemented")
  def list_all(self, provider):
    """
    Find and print all clusters running this type of service.
    """
    raise Exception("Unimplemented")
  def list(self):
    """
    Find and print all the instances running in this cluster.
    """
    raise Exception("Unimplemented")
  def launch_master(self, instance_template, config_dir, client_cidr):
    """
    Launch a "master" instance.
    """
    raise Exception("Unimplemented")
  def launch_slaves(self, instance_template):
    """
    Launch "slave" instance.
    """
    raise Exception("Unimplemented")
  def launch_cluster(self, instance_templates, config_dir, client_cidr):
    """
    Launch a cluster of instances.
    """
    raise Exception("Unimplemented")
  def terminate_cluster(self,  force=False):
    self.cluster.print_status()
    if not force and not self._prompt("Terminate all instances?"):
      print "Not terminating cluster."
    else:
      print "Terminating cluster"
      self.cluster.terminate()
  def delete_cluster(self):
    self.cluster.delete()
  def create_formatted_snapshot(self, size, availability_zone,
                                image_id, key_name, ssh_options):
    Ec2Storage.create_formatted_snapshot(self.cluster, size,
                                         availability_zone,
                                         image_id,
                                         key_name,
                                         ssh_options)
  def list_storage(self):
    storage = self.cluster.get_storage()
    storage.print_status()
  def create_storage(self, role, number_of_instances,
                     availability_zone, spec_file):
    storage = self.cluster.get_storage()
    storage.create(role, number_of_instances, availability_zone, spec_file)
    storage.print_status()
  def attach_storage(self, role):
    storage = self.cluster.get_storage()
    storage.attach(role, self.cluster.get_instances_in_role(role, 'running'))
    storage.print_status()
  def delete_storage(self, force=False):
    storage = self.cluster.get_storage()
    storage.print_status()
    if not force and not self._prompt("Delete all storage volumes? THIS WILL \
      PERMANENTLY DELETE ALL DATA"):
      print "Not deleting storage volumes."
    else:
      print "Deleting storage"
      for role in storage.get_roles():
        storage.delete(role)
  def login(self, ssh_options):
    raise Exception("Unimplemented")
  def proxy(self, ssh_options):
    raise Exception("Unimplemented")
  def push(self, ssh_options, file):
    raise Exception("Unimplemented")
  def execute(self, ssh_options, args):
    raise Exception("Unimplemented")
  def update_slaves_file(self, config_dir, ssh_options, private_key):
    raise Exception("Unimplemented")
  def _prompt(self, prompt):
    """ Returns true if user responds "yes" to prompt. """
    return raw_input("%s [yes or no]: " % prompt).lower() == "yes"
  def _call(self, command):
    print command
    try:
      subprocess.call(command, shell=True)
    except Exception, e:
      print e
  def _get_default_user_data_file_template(self):
    data_path = os.path.join(os.path.dirname(__file__), 'data')
    return os.path.join(data_path, '%s-%s-init-remote.sh' %
                 (self.get_service_code(), self.cluster.get_provider_code()))
  def _launch_instances(self, instance_template):
    it = instance_template
    user_data_file_template = it.user_data_file_template
    if it.user_data_file_template == None:
      user_data_file_template = self._get_default_user_data_file_template()
    ebs_mappings = ''
    storage = self.cluster.get_storage()
    for role in it.roles:
      if storage.has_any_storage((role,)):
        ebs_mappings = storage.get_mappings_string_for_role(role)
    replacements = { "%ENV%": build_env_string(it.env_strings, {
      "ROLES": ",".join(it.roles),
      "USER_PACKAGES": it.user_packages,
      "AUTO_SHUTDOWN": it.auto_shutdown,
      "EBS_MAPPINGS": ebs_mappings,
    }) }
    instance_user_data = InstanceUserData(user_data_file_template, replacements)
    instance_ids = self.cluster.launch_instances(it.roles, it.number, it.image_id,
                                            it.size_id,
                                            instance_user_data,
                                            key_name=it.key_name,
                                            public_key=it.public_key,
                                            placement=it.placement)
    print "Waiting for %s instances in role %s to start" % \
      (it.number, ",".join(it.roles))
    try:
      self.cluster.wait_for_instances(instance_ids)
      print "%s instances started" % ",".join(it.roles)
    except TimeoutException:
      print "Timeout while waiting for %s instance to start." % ",".join(it.roles)
      return
    print
    self.cluster.print_status(it.roles[0])
    return self.cluster.get_instances_in_role(it.roles[0], "running")
 class HadoopService(Service):
  """
  A HDFS and MapReduce service.
  """
  def __init__(self, cluster):
    super(HadoopService, self).__init__(cluster)
  def get_service_code(self):
    return "hadoop"
  def list_all(self, provider):
    """
    Find and print clusters that have a running namenode instances
    """
    legacy_clusters = get_cluster(provider).get_clusters_with_role(MASTER)
    clusters = list(get_cluster(provider).get_clusters_with_role(NAMENODE))
    clusters.extend(legacy_clusters)
    if not clusters:
      print "No running clusters"
    else:
      for cluster in clusters:
        print cluster
  def list(self):
    self.cluster.print_status()
  def launch_master(self, instance_template, config_dir, client_cidr):
    if self.cluster.check_running(NAMENODE, 0) == False:
      return  # don't proceed if another master is running
    self.launch_cluster((instance_template,), config_dir, client_cidr)
  def launch_slaves(self, instance_template):
    instances = self.cluster.check_running(NAMENODE, 1)
    if not instances:
      return
    master = instances[0]
    for role in (NAMENODE, SECONDARY_NAMENODE, JOBTRACKER): 
      singleton_host_env = "%s_HOST=%s" % \
              (self._sanitize_role_name(role), master.public_ip)
      instance_template.add_env_strings((singleton_host_env))
    self._launch_instances(instance_template)              
    self._attach_storage(instance_template.roles)
    self._print_master_url()
  def launch_cluster(self, instance_templates, config_dir, client_cidr):
    number_of_tasktrackers = 0
    roles = []
    for it in instance_templates:
      roles.extend(it.roles)
      if TASKTRACKER in it.roles:
        number_of_tasktrackers += it.number
    self._launch_cluster_instances(instance_templates)
    self._create_client_hadoop_site_file(config_dir)
    self._authorize_client_ports(client_cidr)
    self._attach_storage(roles)
    self._update_cluster_membership(instance_templates[0].public_key,
                                    instance_templates[0].private_key)
    try:
      self._wait_for_hadoop(number_of_tasktrackers)
    except TimeoutException:
      print "Timeout while waiting for Hadoop to start. Please check logs on" +\
        " cluster."
    self._print_master_url()
  def login(self, ssh_options):
    master = self._get_master()
    if not master:
      sys.exit(1)
    subprocess.call('ssh %s root@%s' % \
                    (xstr(ssh_options), master.public_ip),
                    shell=True)
  def proxy(self, ssh_options):
    master = self._get_master()
    if not master:
      sys.exit(1)
    options = '-o "ConnectTimeout 10" -o "ServerAliveInterval 60" ' \
              '-N -D 6666'
    process = subprocess.Popen('ssh %s %s root@%s' %
      (xstr(ssh_options), options, master.public_ip),
      stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE,
      shell=True)
    print """export HADOOP_CLOUD_PROXY_PID=%s;
 echo Proxy pid %s;""" % (process.pid, process.pid)
  def push(self, ssh_options, file):
    master = self._get_master()
    if not master:
      sys.exit(1)
    subprocess.call('scp %s -r %s root@%s:' % (xstr(ssh_options),
                                               file, master.public_ip),
                                               shell=True)
  def execute(self, ssh_options, args):
    master = self._get_master()
    if not master:
      sys.exit(1)
    subprocess.call("ssh %s root@%s '%s'" % (xstr(ssh_options),
                                             master.public_ip,
                                             " ".join(args)), shell=True)
  def update_slaves_file(self, config_dir, ssh_options, private_key):
    instances = self.cluster.check_running(NAMENODE, 1)
    if not instances:
      sys.exit(1)
    master = instances[0]
    slaves = self.cluster.get_instances_in_role(DATANODE, "running")
    cluster_dir = os.path.join(config_dir, self.cluster.name)
    slaves_file = os.path.join(cluster_dir, 'slaves')
    with open(slaves_file, 'w') as f:
      for slave in slaves:
        f.write(slave.public_ip + "\n")
    subprocess.call('scp %s -r %s root@%s:/etc/hadoop/conf' % \
                    (ssh_options, slaves_file, master.public_ip), shell=True)
    # Copy private key
    subprocess.call('scp %s -r %s root@%s:/root/.ssh/id_rsa' % \
                    (ssh_options, private_key, master.public_ip), shell=True)
    for slave in slaves:
      subprocess.call('scp %s -r %s root@%s:/root/.ssh/id_rsa' % \
                      (ssh_options, private_key, slave.public_ip), shell=True)
  def _get_master(self):
    # For split namenode/jobtracker, designate the namenode as the master
    return self._get_namenode()
  def _get_namenode(self):
    instances = self.cluster.get_instances_in_role(NAMENODE, "running")
    if not instances:
      return None
    return instances[0]
  def _get_jobtracker(self):
    instances = self.cluster.get_instances_in_role(JOBTRACKER, "running")
    if not instances:
      return None
    return instances[0]
  def _launch_cluster_instances(self, instance_templates):
    singleton_hosts = []
    for instance_template in instance_templates:
      instance_template.add_env_strings(singleton_hosts)
      instances = self._launch_instances(instance_template)
      if instance_template.number == 1:
        if len(instances) != 1:
          logger.error("Expected a single '%s' instance, but found %s.",
                       "".join(instance_template.roles), len(instances))
          return
        else:
          for role in instance_template.roles:
            singleton_host_env = "%s_HOST=%s" % \
              (self._sanitize_role_name(role),
               instances[0].public_ip)
            singleton_hosts.append(singleton_host_env)
  def _sanitize_role_name(self, role):
    """Replace characters in role name with ones allowed in bash variable names"""
    return role.replace('+', '_').upper()
  def _authorize_client_ports(self, client_cidrs=[]):
    if not client_cidrs:
      logger.debug("No client CIDRs specified, using local address.")
      client_ip = url_get('http://checkip.amazonaws.com/').strip()
      client_cidrs = ("%s/32" % client_ip,)
    logger.debug("Client CIDRs: %s", client_cidrs)
    namenode = self._get_namenode()
    jobtracker = self._get_jobtracker()
    for client_cidr in client_cidrs:
      # Allow access to port 80 on namenode from client
      self.cluster.authorize_role(NAMENODE, 80, 80, client_cidr)
      # Allow access to jobtracker UI on master from client
      # (so we can see when the cluster is ready)
      self.cluster.authorize_role(JOBTRACKER, 50030, 50030, client_cidr)
    # Allow access to namenode and jobtracker via public address from each other
    namenode_ip = socket.gethostbyname(namenode.public_ip)
    jobtracker_ip = socket.gethostbyname(jobtracker.public_ip)
    self.cluster.authorize_role(NAMENODE, 8020, 8020, "%s/32" % namenode_ip)
    self.cluster.authorize_role(NAMENODE, 8020, 8020, "%s/32" % jobtracker_ip)
    self.cluster.authorize_role(JOBTRACKER, 8021, 8021, "%s/32" % namenode_ip)
    self.cluster.authorize_role(JOBTRACKER, 8021, 8021,
                                "%s/32" % jobtracker_ip)
  def _create_client_hadoop_site_file(self, config_dir):
    namenode = self._get_namenode()
    jobtracker = self._get_jobtracker()
    cluster_dir = os.path.join(config_dir, self.cluster.name)
    aws_access_key_id = os.environ.get('AWS_ACCESS_KEY_ID') or ''
    aws_secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY') or ''
    if not os.path.exists(cluster_dir):
      os.makedirs(cluster_dir)
    with open(os.path.join(cluster_dir, 'hadoop-site.xml'), 'w') as f:
      f.write("""<?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <!-- Put site-specific property overrides in this file. -->
  <configuration>
  <property>
    <name>hadoop.job.ugi</name>
    <value>root,root</value>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://%(namenode)s:8020/</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>%(jobtracker)s:8021</value>
  </property>
  <property>
    <name>hadoop.socks.server</name>
    <value>localhost:6666</value>
  </property>
  <property>
    <name>hadoop.rpc.socket.factory.class.default</name>
    <value>org.apache.hadoop.net.SocksSocketFactory</value>
  </property>
  <property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>%(aws_access_key_id)s</value>
  </property>
  <property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>%(aws_secret_access_key)s</value>
  </property>
  <property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>%(aws_access_key_id)s</value>
  </property>
  <property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>%(aws_secret_access_key)s</value>
  </property>
  </configuration>
  """ % {'namenode': namenode.public_ip,
    'jobtracker': jobtracker.public_ip,
    'aws_access_key_id': aws_access_key_id,
    'aws_secret_access_key': aws_secret_access_key})        
  def _wait_for_hadoop(self, number, timeout=600):
    start_time = time.time()
    jobtracker = self._get_jobtracker()
    if not jobtracker:
      return
    print "Waiting for jobtracker to start"
    previous_running = 0
    while True:
      if (time.time() - start_time >= timeout):
        raise TimeoutException()
      try:
        actual_running = self._number_of_tasktrackers(jobtracker.public_ip, 1)
        break
      except IOError:
        pass
      sys.stdout.write(".")
      sys.stdout.flush()
      time.sleep(1)
    print
    if number > 0:
      print "Waiting for %d tasktrackers to start" % number
      while actual_running < number:
        if (time.time() - start_time >= timeout):
          raise TimeoutException()
        try:
          actual_running = self._number_of_tasktrackers(jobtracker.public_ip, 5, 2)
          if actual_running != previous_running:
            sys.stdout.write("%d" % actual_running)
          sys.stdout.write(".")
          sys.stdout.flush()
          time.sleep(1)
          previous_running = actual_running
        except IOError:
          pass
      print
  # The optional ?type=active is a difference between Hadoop 0.18 and 0.20
  _NUMBER_OF_TASK_TRACKERS = re.compile(
    r'<a href="machines.jsp(?:\?type=active)?">(\d+)</a>')
  def _number_of_tasktrackers(self, jt_hostname, timeout, retries=0):
    jt_page = url_get("http://%s:50030/jobtracker.jsp" % jt_hostname, timeout,
                      retries)
    m = self._NUMBER_OF_TASK_TRACKERS.search(jt_page)
    if m:
      return int(m.group(1))
    return 0
  def _print_master_url(self):
    webserver = self._get_jobtracker()
    if not webserver:
      return
    print "Browse the cluster at http://%s/" % webserver.public_ip
  def _attach_storage(self, roles):
    storage = self.cluster.get_storage()
    if storage.has_any_storage(roles):
      print "Waiting 10 seconds before attaching storage"
      time.sleep(10)
      for role in roles:
        storage.attach(role, self.cluster.get_instances_in_role(role, 'running'))
      storage.print_status(roles)
  def _update_cluster_membership(self, public_key, private_key):
    pass
 class ZooKeeperService(Service):
  """
  A ZooKeeper service.
  """
  ZOOKEEPER_ROLE = "zk"
  def __init__(self, cluster):
    super(ZooKeeperService, self).__init__(cluster)
  def get_service_code(self):
    return "zookeeper"
  def launch_cluster(self, instance_templates, config_dir, client_cidr):
    self._launch_cluster_instances(instance_templates)
    self._authorize_client_ports(client_cidr)
    self._update_cluster_membership(instance_templates[0].public_key)
  def _launch_cluster_instances(self, instance_templates):
    for instance_template in instance_templates:
      instances = self._launch_instances(instance_template)
  def _authorize_client_ports(self, client_cidrs=[]):
    if not client_cidrs:
      logger.debug("No client CIDRs specified, using local address.")
      client_ip = url_get('http://checkip.amazonaws.com/').strip()
      client_cidrs = ("%s/32" % client_ip,)
    logger.debug("Client CIDRs: %s", client_cidrs)
    for client_cidr in client_cidrs:
      self.cluster.authorize_role(self.ZOOKEEPER_ROLE, 2181, 2181, client_cidr)
  def _update_cluster_membership(self, public_key):
    time.sleep(30) # wait for SSH daemon to start
    ssh_options = '-o StrictHostKeyChecking=no'
    private_key = public_key[:-4] # TODO: pass in private key explicitly
    instances = self.cluster.get_instances_in_role(self.ZOOKEEPER_ROLE,
                                                   'running')
    config_file = 'zoo.cfg'
    with open(config_file, 'w') as f:
      f.write("""# The number of milliseconds of each tick
 tickTime=2000
 # The number of ticks that the initial
 # synchronization phase can take
 initLimit=10
 # The number of ticks that can pass between
 # sending a request and getting an acknowledgement
 syncLimit=5
 # The directory where the snapshot is stored.
 dataDir=/var/log/zookeeper/txlog
 # The port at which the clients will connect
 clientPort=2181
 # The servers in the ensemble
 """)
      counter = 1
      for i in instances:
        f.write("server.%s=%s:2888:3888\n" % (counter, i.private_ip))
        counter += 1
    # copy to each node in the cluster
    myid_file = 'myid'
    counter = 1
    for i in instances:
      self._call('scp -i %s %s %s root@%s:/etc/zookeeper/conf/zoo.cfg' \
                 % (private_key, ssh_options, config_file, i.public_ip))
      with open(myid_file, 'w') as f:
        f.write(str(counter) + "\n")
      self._call('scp -i %s %s %s root@%s:/var/log/zookeeper/txlog/myid' \
                 % (private_key, ssh_options, myid_file, i.public_ip))
      counter += 1
    os.remove(config_file)
    os.remove(myid_file)
    # start the zookeeper servers
    for i in instances:
      self._call('ssh -i %s %s root@%s nohup /etc/rc.local &' \
                 % (private_key, ssh_options, i.public_ip))
    hosts_string = ",".join(["%s:2181" % i.public_ip for i in instances]) 
    print "ZooKeeper cluster: %s" % hosts_string
 SERVICE_PROVIDER_MAP = {
  "hadoop": {
     "rackspace": ('hadoop.cloud.providers.rackspace', 'RackspaceHadoopService')
  },
  "zookeeper": {
    # "provider_code": ('hadoop.cloud.providers.provider_code', 'ProviderZooKeeperService')
  },
 }
 DEFAULT_SERVICE_PROVIDER_MAP = {
  "hadoop": HadoopService,
  "zookeeper": ZooKeeperService
 }
 def get_service(service, provider):
  """
  Retrieve the Service class for a service and provider.
  """
  try:
    mod_name, service_classname = SERVICE_PROVIDER_MAP[service][provider]
    _mod = __import__(mod_name, globals(), locals(), [service_classname])
    return getattr(_mod, service_classname)
  except KeyError:
    return DEFAULT_SERVICE_PROVIDER_MAP[service]
--- a/src/contrib/cloud/src/py/hadoop/cloud/storage.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/storage.py
@ -1,173 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Classes for controlling external cluster storage.
 """
 import logging
 import simplejson as json
 logger = logging.getLogger(__name__)
 class VolumeSpec(object):
  """
  The specification for a storage volume, encapsulating all the information
  needed to create a volume and ultimately mount it on an instance.
  """
  def __init__(self, size, mount_point, device, snapshot_id):
    self.size = size
    self.mount_point = mount_point
    self.device = device
    self.snapshot_id = snapshot_id
 class JsonVolumeSpecManager(object):
  """
  A container for VolumeSpecs. This object can read VolumeSpecs specified in
  JSON.
  """
  def __init__(self, spec_file):
    self.spec = json.load(spec_file)
  def volume_specs_for_role(self, role):
    return [VolumeSpec(d["size_gb"], d["mount_point"], d["device"],
                       d["snapshot_id"]) for d in self.spec[role]]
  def get_mappings_string_for_role(self, role):
    """
    Returns a short string of the form
    "mount_point1,device1;mount_point2,device2;..."
    which is useful for passing as an environment variable.
    """
    return ";".join(["%s,%s" % (d["mount_point"], d["device"])
                     for d in self.spec[role]])
 class MountableVolume(object):
  """
  A storage volume that has been created. It may or may not have been attached
  or mounted to an instance.
  """
  def __init__(self, volume_id, mount_point, device):
    self.volume_id = volume_id
    self.mount_point = mount_point
    self.device = device
 class JsonVolumeManager(object):
  def __init__(self, filename):
    self.filename = filename
  def _load(self):
    try:
      return json.load(open(self.filename, "r"))
    except IOError:
      logger.debug("File %s does not exist.", self.filename)
      return {}
  def _store(self, obj):
    return json.dump(obj, open(self.filename, "w"), sort_keys=True, indent=2)
  def get_roles(self):
    json_dict = self._load()
    return json_dict.keys()
  def add_instance_storage_for_role(self, role, mountable_volumes):
    json_dict = self._load()
    mv_dicts = [mv.__dict__ for mv in mountable_volumes]
    json_dict.setdefault(role, []).append(mv_dicts)
    self._store(json_dict)
  def remove_instance_storage_for_role(self, role):
    json_dict = self._load()
    del json_dict[role]
    self._store(json_dict)
  def get_instance_storage_for_role(self, role):
    """
    Returns a list of lists of MountableVolume objects. Each nested list is
    the storage for one instance.
    """
    try:
      json_dict = self._load()
      instance_storage = []
      for instance in json_dict[role]:
        vols = []
        for vol in instance:
          vols.append(MountableVolume(vol["volume_id"], vol["mount_point"],
                                      vol["device"]))
        instance_storage.append(vols)
      return instance_storage
    except KeyError:
      return []
 class Storage(object):
  """
  Storage volumes for a cluster. The storage is associated with a named
  cluster. Many clusters just have local storage, in which case this is
  not used.
  """
  def __init__(self, cluster):
    self.cluster = cluster
  def create(self, role, number_of_instances, availability_zone, spec_filename):
    """
    Create new storage volumes for instances with the given role, according to
    the mapping defined in the spec file.
    """
    pass
  def get_mappings_string_for_role(self, role):
    """
    Returns a short string of the form
    "mount_point1,device1;mount_point2,device2;..."
    which is useful for passing as an environment variable.
    """
    raise Exception("Unimplemented")
  def has_any_storage(self, roles):
    """
    Return True if any of the given roles has associated storage
    """
    return False
  def get_roles(self):
    """
    Return a list of roles that have storage defined.
    """
    return []
  def print_status(self, roles=None):
    """
    Print the status of storage volumes for the given roles.
    """
    pass
  def attach(self, role, instances):
    """
    Attach volumes for a role to instances. Some volumes may already be
    attached, in which case they are ignored, and we take care not to attach
    multiple volumes to an instance.
    """
    pass
  def delete(self, roles=[]):
    """
    Permanently delete all the storage for the given roles.
    """
    pass
--- a/src/contrib/cloud/src/py/hadoop/cloud/util.py
+++ b/src/contrib/cloud/src/py/hadoop/cloud/util.py
@ -1,84 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Utility functions.
 """
 import ConfigParser
 import socket
 import urllib2
 def bash_quote(text):
  """Quotes a string for bash, by using single quotes."""
  if text == None:
    return ""
  return "'%s'" % text.replace("'", "'\\''")
 def bash_quote_env(env):
  """Quotes the value in an environment variable assignment."""
  if env.find("=") == -1:
    return env
  (var, value) = env.split("=")
  return "%s=%s" % (var, bash_quote(value))
 def build_env_string(env_strings=[], pairs={}):
  """Build a bash environment variable assignment"""
  env = ''
  if env_strings:
    for env_string in env_strings:
      env += "%s " % bash_quote_env(env_string)
  if pairs:
    for key, val in pairs.items():
      env += "%s=%s " % (key, bash_quote(val))
  return env[:-1]
 def merge_config_with_options(section_name, config, options):
  """
  Merge configuration options with a dictionary of options.
  Keys in the options dictionary take precedence.
  """
  res = {}
  try:
    for (key, value) in config.items(section_name):
      if value.find("\n") != -1:
        res[key] = value.split("\n")
      else:
        res[key] = value
  except ConfigParser.NoSectionError:
    pass
  for key in options:
    if options[key] != None:
      res[key] = options[key]
  return res
 def url_get(url, timeout=10, retries=0):
  """
  Retrieve content from the given URL.
  """
   # in Python 2.6 we can pass timeout to urllib2.urlopen
  socket.setdefaulttimeout(timeout)
  attempts = 0
  while True:
    try:
      return urllib2.urlopen(url).read()
    except urllib2.URLError:
      attempts = attempts + 1
      if attempts > retries:
        raise
 def xstr(string):
  """Sane string conversion: return an empty string if string is None."""
  return '' if string is None else str(string)
--- a/src/contrib/cloud/src/py/setup.py
+++ b/src/contrib/cloud/src/py/setup.py
@ -1,30 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from distutils.core import setup
 version = __import__('hadoop.cloud').cloud.VERSION
 setup(name='hadoop-cloud',
      version=version,
      description='Scripts for running Hadoop on cloud providers',
      license = 'Apache License (2.0)',
      url = 'http://hadoop.apache.org/common/',
      packages=['hadoop', 'hadoop.cloud','hadoop.cloud.providers'],
      package_data={'hadoop.cloud': ['data/*.sh']},
      scripts=['hadoop-ec2'],
      author = 'Apache Hadoop Contributors',
      author_email = 'common-dev@hadoop.apache.org',
 )
--- a/src/contrib/cloud/src/test/py/testcluster.py
+++ b/src/contrib/cloud/src/test/py/testcluster.py
@ -1,37 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import unittest
 from hadoop.cloud.cluster import RoleSyntaxException
 from hadoop.cloud.providers.ec2 import Ec2Cluster
 class TestCluster(unittest.TestCase):
  def test_group_name_for_role(self):
    cluster = Ec2Cluster("test-cluster", None)
    self.assertEqual("test-cluster-foo", cluster._group_name_for_role("foo"))
  def test_check_role_name_valid(self):
    cluster = Ec2Cluster("test-cluster", None)
    cluster._check_role_name(
      "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_+")
  def test_check_role_name_dash_is_invalid(self):
    cluster = Ec2Cluster("test-cluster", None)
    self.assertRaises(RoleSyntaxException, cluster._check_role_name, "a-b")
 if __name__ == '__main__':
  unittest.main()
--- a/src/contrib/cloud/src/test/py/testrackspace.py
+++ b/src/contrib/cloud/src/test/py/testrackspace.py
@ -1,74 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import StringIO
 import unittest
 from hadoop.cloud.providers.rackspace import RackspaceCluster
 class TestCluster(unittest.TestCase):
  class DriverStub(object):
    def list_nodes(self):
      class NodeStub(object):
        def __init__(self, name, metadata):
          self.id = name
          self.name = name
          self.state = 'ACTIVE'
          self.public_ip = ['100.0.0.1']
          self.private_ip = ['10.0.0.1']
          self.extra = { 'metadata': metadata }
      return [NodeStub('random_instance', {}),
              NodeStub('cluster1-nj-000', {'cluster': 'cluster1', 'roles': 'nn,jt'}),
              NodeStub('cluster1-dt-000', {'cluster': 'cluster1', 'roles': 'dn,tt'}),
              NodeStub('cluster1-dt-001', {'cluster': 'cluster1', 'roles': 'dn,tt'}),
              NodeStub('cluster2-dt-000', {'cluster': 'cluster2', 'roles': 'dn,tt'}),
              NodeStub('cluster3-nj-000', {'cluster': 'cluster3', 'roles': 'nn,jt'})]
  def test_get_clusters_with_role(self):
    self.assertEqual(set(['cluster1', 'cluster2']),
      RackspaceCluster.get_clusters_with_role('dn', 'running',
                                           TestCluster.DriverStub()))
  def test_get_instances_in_role(self):
    cluster = RackspaceCluster('cluster1', None, TestCluster.DriverStub())
    instances = cluster.get_instances_in_role('nn')
    self.assertEquals(1, len(instances))
    self.assertEquals('cluster1-nj-000', instances[0].id)
    instances = cluster.get_instances_in_role('tt')
    self.assertEquals(2, len(instances))
    self.assertEquals(set(['cluster1-dt-000', 'cluster1-dt-001']),
                      set([i.id for i in instances]))
  def test_print_status(self):
    cluster = RackspaceCluster('cluster1', None, TestCluster.DriverStub())
    out = StringIO.StringIO()
    cluster.print_status(None, "running", out)
    self.assertEquals("""nn,jt cluster1-nj-000 cluster1-nj-000 100.0.0.1 10.0.0.1 running
 dn,tt cluster1-dt-000 cluster1-dt-000 100.0.0.1 10.0.0.1 running
 dn,tt cluster1-dt-001 cluster1-dt-001 100.0.0.1 10.0.0.1 running
 """, out.getvalue().replace("\t", " "))
    out = StringIO.StringIO()
    cluster.print_status(["dn"], "running", out)
    self.assertEquals("""dn,tt cluster1-dt-000 cluster1-dt-000 100.0.0.1 10.0.0.1 running
 dn,tt cluster1-dt-001 cluster1-dt-001 100.0.0.1 10.0.0.1 running
 """, out.getvalue().replace("\t", " "))
 if __name__ == '__main__':
  unittest.main()
--- a/src/contrib/cloud/src/test/py/teststorage.py
+++ b/src/contrib/cloud/src/test/py/teststorage.py
@ -1,143 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 import unittest
 import simplejson as json
 from StringIO import StringIO
 from hadoop.cloud.storage import MountableVolume
 from hadoop.cloud.storage import JsonVolumeManager
 from hadoop.cloud.storage import JsonVolumeSpecManager
 spec = {
 "master": ({"size_gb":"8", "mount_point":"/", "device":"/dev/sdj",
             "snapshot_id": "snap_1"},
            ),
 "slave": ({"size_gb":"8", "mount_point":"/", "device":"/dev/sdj",
            "snapshot_id": "snap_2"},
           {"size_gb":"10", "mount_point":"/data1", "device":"/dev/sdk",
            "snapshot_id": "snap_3"},
           )
 }
 class TestJsonVolumeSpecManager(unittest.TestCase):
  def test_volume_specs_for_role(self):
    input = StringIO(json.dumps(spec))
    volume_spec_manager = JsonVolumeSpecManager(input)
    master_specs = volume_spec_manager.volume_specs_for_role("master")
    self.assertEqual(1, len(master_specs))
    self.assertEqual("/", master_specs[0].mount_point)
    self.assertEqual("8", master_specs[0].size)
    self.assertEqual("/dev/sdj", master_specs[0].device)
    self.assertEqual("snap_1", master_specs[0].snapshot_id)
    slave_specs = volume_spec_manager.volume_specs_for_role("slave")
    self.assertEqual(2, len(slave_specs))
    self.assertEqual("snap_2", slave_specs[0].snapshot_id)
    self.assertEqual("snap_3", slave_specs[1].snapshot_id)
    self.assertRaises(KeyError, volume_spec_manager.volume_specs_for_role,
                      "no-such-role")
  def test_get_mappings_string_for_role(self):
    input = StringIO(json.dumps(spec))
    volume_spec_manager = JsonVolumeSpecManager(input)
    master_mappings = volume_spec_manager.get_mappings_string_for_role("master")
    self.assertEqual("/,/dev/sdj", master_mappings)
    slave_mappings = volume_spec_manager.get_mappings_string_for_role("slave")
    self.assertEqual("/,/dev/sdj;/data1,/dev/sdk", slave_mappings)
    self.assertRaises(KeyError,
                      volume_spec_manager.get_mappings_string_for_role,
                      "no-such-role")
 class TestJsonVolumeManager(unittest.TestCase):
  def tearDown(self):
    try:
      os.remove("volumemanagertest.json")
    except OSError:
      pass
  def test_add_instance_storage_for_role(self):
    volume_manager = JsonVolumeManager("volumemanagertest.json")
    self.assertEqual(0,
      len(volume_manager.get_instance_storage_for_role("master")))
    self.assertEqual(0, len(volume_manager.get_roles()))
    volume_manager.add_instance_storage_for_role("master",
                                                 [MountableVolume("vol_1", "/",
                                                                  "/dev/sdj")])
    master_storage = volume_manager.get_instance_storage_for_role("master")
    self.assertEqual(1, len(master_storage))
    master_storage_instance0 = master_storage[0]
    self.assertEqual(1, len(master_storage_instance0))
    master_storage_instance0_vol0 = master_storage_instance0[0]
    self.assertEqual("vol_1", master_storage_instance0_vol0.volume_id)
    self.assertEqual("/", master_storage_instance0_vol0.mount_point)
    self.assertEqual("/dev/sdj", master_storage_instance0_vol0.device)
    volume_manager.add_instance_storage_for_role("slave",
                                                 [MountableVolume("vol_2", "/",
                                                                  "/dev/sdj")])
    self.assertEqual(1,
      len(volume_manager.get_instance_storage_for_role("master")))
    slave_storage = volume_manager.get_instance_storage_for_role("slave")
    self.assertEqual(1, len(slave_storage))
    slave_storage_instance0 = slave_storage[0]
    self.assertEqual(1, len(slave_storage_instance0))
    slave_storage_instance0_vol0 = slave_storage_instance0[0]
    self.assertEqual("vol_2", slave_storage_instance0_vol0.volume_id)
    self.assertEqual("/", slave_storage_instance0_vol0.mount_point)
    self.assertEqual("/dev/sdj", slave_storage_instance0_vol0.device)
    volume_manager.add_instance_storage_for_role("slave",
      [MountableVolume("vol_3", "/", "/dev/sdj"),
       MountableVolume("vol_4", "/data1", "/dev/sdk")])
    self.assertEqual(1,
      len(volume_manager.get_instance_storage_for_role("master")))
    slave_storage = volume_manager.get_instance_storage_for_role("slave")
    self.assertEqual(2, len(slave_storage))
    slave_storage_instance0 = slave_storage[0]
    slave_storage_instance1 = slave_storage[1]
    self.assertEqual(1, len(slave_storage_instance0))
    self.assertEqual(2, len(slave_storage_instance1))
    slave_storage_instance1_vol0 = slave_storage_instance1[0]
    slave_storage_instance1_vol1 = slave_storage_instance1[1]
    self.assertEqual("vol_3", slave_storage_instance1_vol0.volume_id)
    self.assertEqual("/", slave_storage_instance1_vol0.mount_point)
    self.assertEqual("/dev/sdj", slave_storage_instance1_vol0.device)
    self.assertEqual("vol_4", slave_storage_instance1_vol1.volume_id)
    self.assertEqual("/data1", slave_storage_instance1_vol1.mount_point)
    self.assertEqual("/dev/sdk", slave_storage_instance1_vol1.device)
    roles = volume_manager.get_roles()
    self.assertEqual(2, len(roles))
    self.assertTrue("slave" in roles)
    self.assertTrue("master" in roles)
 if __name__ == '__main__':
  unittest.main()
--- a/src/contrib/cloud/src/test/py/testuserdata.py
+++ b/src/contrib/cloud/src/test/py/testuserdata.py
@ -1,44 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import tempfile
 import unittest
 from hadoop.cloud.cluster import InstanceUserData
 class TestInstanceUserData(unittest.TestCase):
  def test_replacement(self):
    file = tempfile.NamedTemporaryFile()
    file.write("Contents go here")
    file.flush()
    self.assertEqual("Contents go here",
                     InstanceUserData(file.name, {}).read())
    self.assertEqual("Contents were here",
                     InstanceUserData(file.name, { "go": "were"}).read())
    self.assertEqual("Contents  here",
                     InstanceUserData(file.name, { "go": None}).read())
    file.close()
  def test_read_file_url(self):
    file = tempfile.NamedTemporaryFile()
    file.write("Contents go here")
    file.flush()
    self.assertEqual("Contents go here",
                     InstanceUserData("file://%s" % file.name, {}).read())
    file.close()
 if __name__ == '__main__':
  unittest.main()
--- a/src/contrib/cloud/src/test/py/testutil.py
+++ b/src/contrib/cloud/src/test/py/testutil.py
@ -1,81 +0,0 @@
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import ConfigParser
 import StringIO
 import unittest
 from hadoop.cloud.util import bash_quote
 from hadoop.cloud.util import bash_quote_env
 from hadoop.cloud.util import build_env_string
 from hadoop.cloud.util import merge_config_with_options
 from hadoop.cloud.util import xstr
 class TestUtilFunctions(unittest.TestCase):
  def test_bash_quote(self):
    self.assertEqual("", bash_quote(None))
    self.assertEqual("''", bash_quote(""))
    self.assertEqual("'a'", bash_quote("a"))
    self.assertEqual("'a b'", bash_quote("a b"))
    self.assertEqual("'a\b'", bash_quote("a\b"))
    self.assertEqual("'a '\\'' b'", bash_quote("a ' b"))
  def test_bash_quote_env(self):
    self.assertEqual("", bash_quote_env(""))
    self.assertEqual("a", bash_quote_env("a"))
    self.assertEqual("a='b'", bash_quote_env("a=b"))
    self.assertEqual("a='b c'", bash_quote_env("a=b c"))
    self.assertEqual("a='b\c'", bash_quote_env("a=b\c"))
    self.assertEqual("a='b '\\'' c'", bash_quote_env("a=b ' c"))
  def test_build_env_string(self):
    self.assertEqual("", build_env_string())
    self.assertEqual("a='b' c='d'",
                     build_env_string(env_strings=["a=b", "c=d"]))
    self.assertEqual("a='b' c='d'",
                     build_env_string(pairs={"a": "b", "c": "d"}))
  def test_merge_config_with_options(self):
    options = { "a": "b" }
    config = ConfigParser.ConfigParser()
    self.assertEqual({ "a": "b" },
                     merge_config_with_options("section", config, options))
    config.add_section("section")
    self.assertEqual({ "a": "b" },
                     merge_config_with_options("section", config, options))
    config.set("section", "a", "z")
    config.set("section", "c", "d")
    self.assertEqual({ "a": "z", "c": "d" },
                     merge_config_with_options("section", config, {}))
    self.assertEqual({ "a": "b", "c": "d" },
                     merge_config_with_options("section", config, options))
  def test_merge_config_with_options_list(self):
    config = ConfigParser.ConfigParser()
    config.readfp(StringIO.StringIO("""[section]
 env1=a=b
 c=d
 env2=e=f
 g=h"""))
    self.assertEqual({ "env1": ["a=b", "c=d"], "env2": ["e=f", "g=h"] },
                     merge_config_with_options("section", config, {}))
  def test_xstr(self):
    self.assertEqual("", xstr(None))
    self.assertEqual("a", xstr("a"))
 if __name__ == '__main__':
  unittest.main()
--- a/src/contrib/cloud/tools/rackspace/remote-setup.sh
+++ b/src/contrib/cloud/tools/rackspace/remote-setup.sh
@ -1,46 +0,0 @@
 #!/bin/bash -x
 #
 # Licensed to the Apache Software Foundation (ASF) under one or more
 # contributor license agreements.  See the NOTICE file distributed with
 # this work for additional information regarding copyright ownership.
 # The ASF licenses this file to You under the Apache License, Version 2.0
 # (the "License"); you may not use this file except in compliance with
 # the License.  You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
 # Given an Ubuntu base system install, install the base packages we need.
 #
 # We require multiverse to be enabled.
 cat >> /etc/apt/sources.list << EOF
 deb http://us.archive.ubuntu.com/ubuntu/ intrepid multiverse
 deb-src http://us.archive.ubuntu.com/ubuntu/ intrepid multiverse
 deb http://us.archive.ubuntu.com/ubuntu/ intrepid-updates multiverse
 deb-src http://us.archive.ubuntu.com/ubuntu/ intrepid-updates multiverse
 EOF
 apt-get update
 # Install Java
 apt-get -y install sun-java6-jdk
 echo "export JAVA_HOME=/usr/lib/jvm/java-6-sun" >> /etc/profile
 export JAVA_HOME=/usr/lib/jvm/java-6-sun
 java -version
 # Install general packages
 apt-get -y install vim curl screen ssh rsync unzip openssh-server
 apt-get -y install policykit # http://www.bergek.com/2008/11/24/ubuntu-810-libpolkit-error/
 # Create root's .ssh directory if it doesn't exist
 mkdir -p /root/.ssh
 # Run any rackspace init script injected at boot time
 echo '[ -f /etc/init.d/rackspace-init.sh ] && /bin/sh /etc/init.d/rackspace-init.sh; exit 0' > /etc/rc.local