Hadoop On Demand (HOD) is a system for provisioning and managing independent Hadoop MapReduce and -Hadoop Distributed File System (HDFS) instances on a shared cluster of nodes. HOD is a tool that makes it easy -for administrators and users to quickly setup and use Hadoop. HOD is also a very useful tool for Hadoop developers -and testers who need to share a physical cluster for testing their own Hadoop versions.
- -HOD uses the Torque resource manager to do node allocation. On the allocated nodes, it can start Hadoop -MapReduce and HDFS daemons. It automatically generates the appropriate configuration files (hadoop-site.xml) -for the Hadoop daemons and client. HOD also has the capability to distribute Hadoop to the nodes in the virtual -cluster that it allocates. HOD supports Hadoop from version 0.15 onwards.
-This section shows users how to get started using HOD, reviews various HOD features and command line options, - and provides detailed troubleshooting help.
- -In this section, we shall see a step-by-step introduction on how to use HOD for the most basic operations. Before - following these steps, it is assumed that HOD and its dependent hardware and software components are setup and - configured correctly. This is a step that is generally performed by system administrators of the cluster.
- -The HOD user interface is a command line utility called hod
. It is driven by a configuration file,
- that is typically setup for users by system administrators. Users can override this configuration when using
- the hod
, which is described later in this documentation. The configuration file can be specified in
- two ways when using hod
, as described below:
hod <operation> <required-args> -c path-to-the-configuration-file [other-options]
hod
will be run.
- This should be pointed to a directory on the local file system, containing a file called hodrc.
- Note that this is analogous to the HADOOP_CONF_DIR and hadoop-site.xml file for Hadoop.
- If no configuration file is specified on the command line, hod
shall look for the HOD_CONF_DIR
- environment variable and a hodrc file under that.In examples listed below, we shall not explicitly point to the configuration option, assuming it is correctly specified.
- -A typical session of HOD will involve at least three steps: allocate, run hadoop jobs, deallocate. In order to do this, - perform the following steps.
- -Create a Cluster Directory
The cluster directory is a directory on the local file system where hod
will generate the
- Hadoop configuration, hadoop-site.xml, corresponding to the cluster it allocates. Pass this directory to the
- hod
operations as stated below. If the cluster directory passed doesn't already exist, HOD will automatically
- try to create it and use it. Once a cluster is allocated, a user can utilize it to run Hadoop jobs by specifying the cluster
- directory as the Hadoop --config option.
Operation allocate
The allocate operation is used to allocate a set of nodes and install and provision Hadoop on them. - It has the following syntax. Note that it requires a cluster_dir ( -d, --hod.clusterdir) and the number of nodes - (-n, --hod.nodecount) needed to be allocated:
- - - -If the command completes successfully, then cluster_dir/hadoop-site.xml
will be generated and
- will contain information about the allocated cluster. It will also print out the information about the Hadoop web UIs.
An example run of this command produces the following output. Note in this example that ~/hod-clusters/test
- is the cluster directory, and we are allocating 5 nodes:
Running Hadoop jobs using the allocated cluster
Now, one can run Hadoop jobs using the allocated cluster in the usual manner. This assumes variables like JAVA_HOME - and path to the Hadoop installation are set up correctly.:
- - -or
- - - -Continuing our example, the following command will run a wordcount example on the allocated cluster:
- - -or
- - - -Operation deallocate
The deallocate operation is used to release an allocated cluster. When finished with a cluster, deallocate must be - run so that the nodes become free for others to use. The deallocate operation has the following syntax. Note that it - requires the cluster_dir (-d, --hod.clusterdir) argument:
- - -Continuing our example, the following command will deallocate the cluster:
- - -As can be seen, HOD allows the users to allocate a cluster, and use it flexibly for running Hadoop jobs. For example, users - can run multiple jobs in parallel on the same cluster, by running hadoop from multiple shells pointing to the same configuration.
-The HOD script operation combines the operations of allocating, using and deallocating a cluster into a single operation.
- This is very useful for users who want to run a script of hadoop jobs and let HOD handle the cleanup automatically once the script completes.
- In order to run hadoop scripts using hod
, do the following:
Create a script file
This will be a regular shell script that will typically contain hadoop commands, such as:
- - - -However, the user can add any valid commands as part of the script. HOD will execute this script setting HADOOP_CONF_DIR - automatically to point to the allocated cluster. So users do not need to worry about this. The users however need to specify a cluster directory - just like when using the allocate operation.
-Running the script
The syntax for the script operation as is as follows. Note that it requires a cluster directory ( -d, --hod.clusterdir), number of - nodes (-n, --hod.nodecount) and a script file (-s, --hod.script):
- - -Note that HOD will deallocate the cluster as soon as the script completes, and this means that the script must not complete until the - hadoop jobs themselves are completed. Users must take care of this while writing the script.
-The primary feature of HOD is to provision Hadoop MapReduce and HDFS clusters. This is described above in the Getting Started section.
- Also, as long as nodes are available, and organizational policies allow, a user can use HOD to allocate multiple MapReduce clusters simultaneously.
- The user would need to specify different paths for the cluster_dir
parameter mentioned above for each cluster he/she allocates.
- HOD provides the list and the info operations to enable managing multiple clusters.
Operation list
The list operation lists all the clusters allocated so far by a user. The cluster directory where the hadoop-site.xml is stored for the cluster, - and its status vis-a-vis connectivity with the JobTracker and/or HDFS is shown. The list operation has the following syntax:
- - - -Operation info
The info operation shows information about a given cluster. The information shown includes the Torque job id, and locations of the important - daemons like the HOD Ringmaster process, and the Hadoop JobTracker and NameNode daemons. The info operation has the following syntax. - Note that it requires a cluster directory (-d, --hod.clusterdir):
- - - -The cluster_dir
should be a valid cluster directory specified in an earlier allocate operation.
When provisioning Hadoop, HOD can use either a pre-installed Hadoop on the cluster nodes or distribute and install a Hadoop tarball as part - of the provisioning operation. If the tarball option is being used, there is no need to have a pre-installed Hadoop on the cluster nodes, nor a need - to use a pre-installed one. This is especially useful in a development / QE environment where individual developers may have different versions of - Hadoop to test on a shared cluster.
- -In order to use a pre-installed Hadoop, you must specify, in the hodrc, the pkgs
option in the gridservice-hdfs
- and gridservice-mapred
sections. This must point to the path where Hadoop is installed on all nodes of the cluster.
The syntax for specifying tarball is as follows:
- - - -For example, the following command allocates Hadoop provided by the tarball ~/share/hadoop.tar.gz
:
Similarly, when using hod script, the syntax is as follows:
- - -The hadoop_tarball specified in the syntax above should point to a path on a shared file system that is accessible from all the compute nodes. - Currently, HOD only supports NFS mounted file systems.
-Note:
-In typical Hadoop clusters provisioned by HOD, HDFS is already set up statically (without using HOD). This allows data to persist in HDFS after
- the HOD provisioned clusters is deallocated. To use a statically configured HDFS, your hodrc must point to an external HDFS. Specifically, set the
- following options to the correct values in the section gridservice-hdfs
of the hodrc:
Note: You can also enable this option from command line. That is, to use a static HDFS, you will need to say:
-
HOD can be used to provision an HDFS cluster as well as a MapReduce cluster, if required. To do so, set the following option in the section
- gridservice-hdfs
of the hodrc:
HOD provides a very convenient mechanism to configure both the Hadoop daemons that it provisions and also the hadoop-site.xml that - it generates on the client side. This is done by specifying Hadoop configuration parameters in either the HOD configuration file, or from the - command line when allocating clusters.
- -Configuring Hadoop Daemons
For configuring the Hadoop daemons, you can do the following:
- -For MapReduce, specify the options as a comma separated list of key-value pairs to the server-params
option in the
- gridservice-mapred
section. Likewise for a dynamically provisioned HDFS cluster, specify the options in the
- server-params
option in the gridservice-hdfs
section. If these parameters should be marked as
- final, then include these in the final-server-params
option of the appropriate section.
For example:
- -In order to provide the options from command line, you can use the following syntax:
-For configuring the MapReduce daemons use:
- - - -In the example above, the mapred.reduce.parallel.copies parameter and the io.sort.factor
- parameter will be appended to the other server-params
or if they already exist in server-params
,
- will override them. In order to specify these are final parameters, you can use:
However, note that final parameters cannot be overwritten from command line. They can only be appended if not already specified.
- -Similar options exist for configuring dynamically provisioned HDFS daemons. For doing so, replace -M with -H and -F with -S.
- -Configuring Hadoop Job Submission (Client) Programs
As mentioned above, if the allocation operation completes successfully then cluster_dir/hadoop-site.xml
will be generated
- and will contain information about the allocated cluster's JobTracker and NameNode. This configuration is used when submitting jobs to the cluster.
- HOD provides an option to include additional Hadoop configuration parameters into this file. The syntax for doing so is as follows:
In this example, the mapred.userlog.limit.kb and mapred.child.java.opts options will be included into - the hadoop-site.xml that is generated by HOD.
-The HOD allocation operation prints the JobTracker and NameNode web UI URLs. For example:
- - - -The same information is also available via the info operation described above.
-To get the Hadoop logs of the daemons running on one of the allocated nodes:
-ps ux | grep TaskTracker
)-Dhadoop.log.dir
. Typically this will be a decendent directory
- of the hodring.temp-dir
value from the hod configuration file.hadoop.log.dir
directory to view daemon and user logs.HOD also provides a mechanism to collect logs when a cluster is being deallocated and persist them into a file system, or an externally - configured HDFS. By doing so, these logs can be viewed after the jobs are completed and the nodes are released. In order to do so, configure - the log-destination-uri to a URI as follows:
- - -Under the root directory specified above in the path, HOD will create a path user_name/torque_jobid and store gzipped log files for each - node that was part of the job.
-Note that to store the files to HDFS, you may need to configure the hodring.pkgs
option with the Hadoop version that
- matches the HDFS mentioned. If not, HOD will try to use the Hadoop version that it is using to provision the Hadoop cluster itself.
HOD automatically deallocates clusters that are not running Hadoop jobs for a given period of time. Each HOD allocation includes a - monitoring facility that constantly checks for running Hadoop jobs. If it detects no running Hadoop jobs for a given period, it will automatically - deallocate its own cluster and thus free up nodes which are not being used effectively.
- -Note: While the cluster is deallocated, the cluster directory is not cleaned up automatically. The user must - deallocate this cluster through the regular deallocate operation to clean this up.
-HOD allows the user to specify a wallclock time and a name (or title) for a Torque job.
-The wallclock time is the estimated amount of time for which the Torque job will be valid. After this time has expired, Torque will - automatically delete the job and free up the nodes. Specifying the wallclock time can also help the job scheduler to better schedule - jobs, and help improve utilization of cluster resources.
-To specify the wallclock time, use the following syntax:
- - -The name or title of a Torque job helps in user friendly identification of the job. The string specified here will show up in all information
- where Torque job attributes are displayed, including the qstat
command.
To specify the name or title, use the following syntax:
- - -Note: Due to restriction in the underlying Torque resource manager, names which do not start with an alphabet character - or contain a 'space' will cause the job to fail. The failure message points to the problem being in the specified job name.
-HOD exit codes are captured in the Torque exit_status field. This will help users and system administrators to distinguish successful - runs from unsuccessful runs of HOD. The exit codes are 0 if allocation succeeded and all hadoop jobs ran on the allocated cluster correctly. - They are non-zero if allocation failed or some of the hadoop jobs failed on the allocated cluster. The exit codes that are possible are - mentioned in the table below. Note: Hadoop job status is captured only if the version of Hadoop used is 16 or above.
-Exit Code | -Meaning | -
---|---|
6 | -Ringmaster failure | -
7 | -HDFS failure | -
8 | -Job tracker failure | -
10 | -Cluster dead | -
12 | -Cluster already allocated | -
13 | -HDFS dead | -
14 | -Mapred dead | -
16 | -All MapReduce jobs that ran on the cluster failed. Refer to hadoop logs for more details. | -
17 | -Some of the MapReduce jobs that ran on the cluster failed. Refer to hadoop logs for more details. | -
HOD command line has the following general syntax:
- - -Allowed operations are 'allocate', 'deallocate', 'info', 'list', 'script' and 'help'. For help with a particular operation do:
- - -To have a look at possible options do:
- - -hadoop
commands. Note that the cluster_dir
must exist before running the command.hod help
gives the usage and basic options, and is equivalent to
- hod --help
(See below). When 'options' is given as argument, hod displays only the basic options
- that hod takes. When an operation is specified, it displays the usage and description corresponding to that particular
- operation. For e.g, to know about allocate operation, one can do a hod help allocate
Besides the operations, HOD can take the following command line options.
- ---section_name.option_name[=value]
. When provided this way, the value provided on command line
- overrides the option provided in hodrc. The verbose-help command lists all the available options in the hodrc file.
- This is also a nice way to see the meaning of the configuration options. See Options Configuring HOD for a description of most important hod configuration options.
- For basic options do hod help options
and for all options possible in hod configuration do hod --verbose-help
.
- See HOD Configuration for a description of all options.
As described above, HOD is configured using a configuration file that is usually set up by system administrators. - This is a INI style configuration file that is divided into sections, and options inside each section. Each section relates - to one of the HOD processes: client, ringmaster, hodring, mapreduce or hdfs. The options inside a section comprise - of an option name and value.
- -Users can override the configuration defined in the default configuration in two ways:
--c
optionThis section describes some of the most commonly used configuration options. These commonly used options are - provided with a short option for convenience of specification. All other options can be specified using - a long option that is also described below.
- -HOD_CONF_DIR
environment variable can be defined to specify a directory that contains a file
- named hodrc
, alleviating the need to specify the configuration file in each HOD command.hod
will generate the Hadoop configuration,
- hadoop-site.xml, corresponding to the cluster it allocates. Pass it to the hod
operations as an argument
- to -d or --hod.clusterdir. If it doesn't already exist, HOD will automatically try to create it and use it. Once a cluster is allocated, a
- user can utilize it to run Hadoop jobs by specifying the clusterdirectory as the Hadoop --config option.qsub -N
option, and can be seen as the job name using the qstat
command.The following section identifies some of the most likely error conditions users can run into when using HOD and ways to trouble-shoot them
- -Possible Cause: One of the HOD or Hadoop components have failed to come up. In such a case, the
- hod
command will return after a few minutes (typically 2-3 minutes) with an error code of either 7 or 8
- as defined in the Error Codes section. Refer to that section for further details.
Possible Cause: A large allocation is fired with a tarball. Sometimes due to load in the network, or on - the allocated nodes, the tarball distribution might be significantly slow and take a couple of minutes to come back. - Wait for completion. Also check that the tarball does not have the Hadoop sources or documentation.
-Possible Cause: A Torque related problem. If the cause is Torque related, the hod
- command will not return for more than 5 minutes. Running hod
in debug mode may show the
- qstat
command being executed repeatedly. Executing the qstat
command from
- a separate shell may show that the job is in the Q
(Queued) state. This usually indicates a
- problem with Torque. Possible causes could include some nodes being down, or new nodes added that Torque
- is not aware of. Generally, system administator help is needed to resolve this problem.
Possible Cause: A Torque related problem, usually load on the Torque server, or the allocation is very large. - Generally, waiting for the command to complete is the only option.
-If the exit code of the hod
command is not 0
, then refer to the following table
- of error exit codes to determine why the code may have occurred and how to debug the situation.
Error Codes
Error Code | -Meaning | -Possible Causes and Remedial Actions | -
---|---|---|
1 | -Configuration error | -Incorrect configuration values specified in hodrc, or other errors related to HOD configuration. - The error messages in this case must be sufficient to debug and fix the problem. | -
2 | -Invalid operation | - Do hod help for the list of valid operations. |
-
3 | -Invalid operation arguments | - Do hod help operation for listing the usage of a particular operation. |
-
4 | -Scheduler failure | - 1. Requested more resources than available. Run checknodes cluster_name to see if enough nodes are available. - 2. Requested resources exceed resource manager limits. - 3. Torque is misconfigured, the path to Torque binaries is misconfigured, or other Torque problems. Contact system administrator. |
-
5 | -Job execution failure | - 1. Torque Job was deleted from outside. Execute the Torque qstat command to see if you have any jobs in the
- R (Running) state. If none exist, try re-executing HOD. - 2. Torque problems such as the server momentarily going down, or becoming unresponsive. Contact system administrator. - 3. The system administrator might have configured account verification, and an invalid account is specified. Contact system administrator. |
-
6 | -Ringmaster failure | - HOD prints the message "Cluster could not be allocated because of the following errors on the ringmaster host <hostname>".
- The actual error message may indicate one of the following: - 1. Invalid configuration on the node running the ringmaster, specified by the hostname in the error message. - 2. Invalid configuration in the ringmaster section,- 3. Invalid pkgs option in gridservice-mapred or gridservice-hdfs section,- 4. An invalid hadoop tarball, or a tarball which has bundled an invalid configuration file in the conf directory, - 5. Mismatched version in Hadoop between the MapReduce and an external HDFS. - The Torque qstat command will most likely show a job in the C (Completed) state. - One can login to the ringmaster host as given by HOD failure message and debug the problem with the help of the error message. - If the error message doesn't give complete information, ringmaster logs should help finding out the root cause of the problem. - Refer to the section Locating Ringmaster Logs below for more information. |
-
7 | -HDFS failure | - When HOD fails to allocate due to HDFS failures (or Job tracker failures, error code 8, see below), it prints a failure message
- "Hodring at <hostname> failed with following errors:" and then gives the actual error message, which may indicate one of the following: - 1. Problem in starting Hadoop clusters. Usually the actual cause in the error message will indicate the problem on the hostname mentioned. - Also, review the Hadoop related configuration in the HOD configuration files. Look at the Hadoop logs using information specified in - Collecting and Viewing Hadoop Logs section above. - 2. Invalid configuration on the node running the hodring, specified by the hostname in the error message - 3. Invalid configuration in the hodring section of hodrc. ssh to the hostname specified in the
- error message and grep for ERROR or CRITICAL in hodring logs. Refer to the section
- Locating Hodring Logs below for more information. - 4. Invalid tarball specified which is not packaged correctly. - 5. Cannot communicate with an externally configured HDFS. - When such HDFS or Job tracker failure occurs, one can login into the host with hostname mentioned in HOD failure message and debug the problem. - While fixing the problem, one should also review other log messages in the ringmaster log to see which other machines also might have had problems - bringing up the jobtracker/namenode, apart from the hostname that is reported in the failure message. This possibility of other machines also having problems - occurs because HOD continues to try and launch hadoop daemons on multiple machines one after another depending upon the value of the configuration - variable ringmaster.max-master-failures. - See Locating Ringmaster Logs for more information. |
-
8 | -Job tracker failure | -Similar to the causes in DFS failure case. | -
10 | -Cluster dead | - 1. Cluster was auto-deallocated because it was idle for a long time. - 2. Cluster was auto-deallocated because the wallclock time specified by the system administrator or user was exceeded. - 3. Cannot communicate with the JobTracker and HDFS NameNode which were successfully allocated. Deallocate the cluster, and allocate again. |
-
12 | -Cluster already allocated | -The cluster directory specified has been used in a previous allocate operation and is not yet deallocated. - Specify a different directory, or deallocate the previous allocation first. | -
13 | -HDFS dead | -Cannot communicate with the HDFS NameNode. HDFS NameNode went down. | -
14 | -Mapred dead | - 1. Cluster was auto-deallocated because it was idle for a long time. - 2. Cluster was auto-deallocated because the wallclock time specified by the system administrator or user was exceeded. - 3. Cannot communicate with the MapReduce JobTracker. JobTracker node went down. - |
-
15 | -Cluster not allocated | -An operation which requires an allocated cluster is given a cluster directory with no state information. | -
Any non-zero exit code | -HOD script error | -If the hod script option was used, it is likely that the exit code is from the script. Unfortunately, this could clash with the - exit codes of the hod command itself. In order to help users differentiate these two, hod writes the script's exit code to a file - called script.exitcode in the cluster directory, if the script returned an exit code. You can cat this file to determine the script's - exit code. If it does not exist, then it is a hod command exit code. | -
Sometimes, when you try to upload a file to the HDFS immediately after - allocating a HOD cluster, DFSClient warns with a NotReplicatedYetException. It - usually shows a message something like -
- - - -This scenario arises when you try to upload a file - to the HDFS while the DataNodes are still in the process of contacting the - NameNode. This can be resolved by waiting for some time before uploading a new - file to the HDFS, so that enough DataNodes start and contact the NameNode.
-This scenario generally occurs when a cluster is allocated, and is left inactive for sometime, and then hadoop jobs - are attempted to be run on them. Then Hadoop jobs fail with the following exception:
- - - -Possible Cause: No Hadoop jobs were run for a significant portion of time. Thus the cluster would have got - deallocated as described in the section Auto-deallocation of Idle Clusters. Deallocate the cluster and allocate it again.
-Possible Cause: The wallclock limit specified by the Torque administrator or the -l
option
- defined in the section Specifying Additional Job Attributes was exceeded since allocation time. Thus the cluster
- would have got released. Deallocate the cluster and allocate it again.
Possible Cause: There is a version mismatch between the version of the hadoop being used in provisioning - (typically via the tarball option) and the external HDFS. Ensure compatible versions are being used.
-Possible Cause: There is a version mismatch between the version of the hadoop client being used to submit - jobs and the hadoop used in provisioning (typically via the tarball option). Ensure compatible versions are being used.
-Possible Cause: You used one of the options for specifying Hadoop configuration -M or -H
,
- which had special characters like space or comma that were not escaped correctly. Refer to the section
- Options Configuring HOD for checking how to specify such options correctly.
Possible Cause: The wallclock limit specified by the Torque administrator or the -l
- option defined in the section Specifying Additional Job Attributes was exceeded since allocation time.
- Thus the cluster would have got released. Deallocate the cluster and allocate it again, this time with a larger wallclock time.
Possible Cause: Problems with the JobTracker node. Refer to the section in Collecting and Viewing Hadoop Logs to get more information.
-Possible Cause: The hadoop job was being run as part of the HOD script command, and it started before the JobTracker could come up fully.
- Allocate the cluster using a large value for the configuration option --hod.script-wait-time
.
- Typically a value of 120 should work, though it is typically unnecessary to be that large.
Possible Cause: Version 0.16 of hadoop is required for this functionality to work. - The version of Hadoop used does not match. Use the required version of Hadoop.
-Possible Cause: The deallocation was done without using the hod
- command; for e.g. directly using qdel
. When the cluster is deallocated in this manner,
- the HOD processes are terminated using signals. This results in the exit code to be based on the
- signal number, rather than the exit code of the program.
Possible Cause: There is a version mismatch between the version of the hadoop being used for uploading the logs
- and the external HDFS. Ensure that the correct version is specified in the hodring.pkgs
option.
To locate the ringmaster logs, follow these steps:
-qstat -f torque_job_id
and look up the value of the exec_host
parameter in the output.
- The first host in this list is the ringmaster node.ringmaster.log-dir
option in the hodrc. The name of the log file will be
- username.torque_job_id/ringmaster-main.log
.--ringmaster.debug 4
to the hod command line.To locate hodring logs, follow the steps below:
-qstat -f torque_job_id
and look up the value of the exec_host
parameter in the output.
- All nodes in this list should have a hodring on them.hodring.log-dir
option in the hodrc. The name of the log file will be
- username.torque_job_id/hodring-main.log
.--hodring.debug 4
to the hod command line.This section show administrators how to install, configure and run HOD.
-The basic system architecture of HOD includes these components:
--HOD provisions and maintains Hadoop MapReduce and, optionally, HDFS instances -through interaction with the above components on a given cluster of nodes. A cluster of -nodes can be thought of as comprising two sets of nodes:
--Here is a brief description of the sequence of operations in allocating a cluster and -running jobs on them. -
- -To use HOD, your system should include the following components.
- -Note: HOD configuration requires the location of installs of these -components to be the same on all nodes in the cluster. It will also -make the configuration simpler to have the same location on the submit -nodes. -
-Currently HOD works with the Torque resource manager, which it uses for its node - allocation and job submission. Torque is an open source resource manager from - Cluster Resources, a community effort - based on the PBS project. It provides control over batch jobs and distributed compute nodes. Torque is - freely available for download from here. -
- -All documentation related to torque can be seen under - the section TORQUE Resource Manager here. You can - get wiki documentation from here. - Users may wish to subscribe to TORQUE’s mailing list or view the archive for questions, - comments here. -
- -To use HOD with Torque:
-qmgr -c "set node node properties=cluster-name"
. The name of the cluster is the same as
- the HOD configuration parameter, hod.cluster. echo "sleep 30" | qsub -l nodes=3
Once the resource manager is set up, you can obtain and -install HOD.
-You can configure HOD once it is installed. The minimal configuration needed -to run HOD is described below. More advanced configuration options are discussed -in the HOD Configuration.
-To get started using HOD, the following minimal configuration is - required:
-Specify values suitable to your environment for the following - variables defined in the configuration file. Note that some of these - variables are defined at more than one place in the file.
- -The following environment variables may need to be set depending on - your environment. These variables must be defined where you run the - HOD client and must also be specified in the HOD configuration file as the - value of the key resource_manager.env-vars. Multiple variables can be - specified as a comma separated list of key=value pairs.
- -You can review and modify other configuration options to suit - your specific needs. See HOD Configuration for more information.
-You can run HOD once it is configured. Refer to HOD Users for more information.
-This section describes supporting tools and utilities that can be used to - manage HOD deployments.
- -As mentioned under - Collecting and Viewing Hadoop Logs, - HOD can be configured to upload - Hadoop logs to a statically configured HDFS. Over time, the number of logs uploaded - to HDFS could increase. logcondense.py is a tool that helps - administrators to remove log files uploaded to HDFS.
-logcondense.py is available under hod_install_location/support folder. You can either - run it using python, for example, python logcondense.py, or give execute permissions - to the file, and directly run it as logcondense.py. logcondense.py needs to be - run by a user who has sufficient permissions to remove files from locations where log - files are uploaded in the HDFS, if permissions are enabled. For example as mentioned under - hodring options, the logs could - be configured to come under the user's home directory in HDFS. In that case, the user - running logcondense.py should have super user privileges to remove the files from under - all user home directories.
-The following command line options are supported for logcondense.py.
-Short Option | -Long option | -Meaning | -Example | -
---|---|---|---|
-p | ---package | -Complete path to the hadoop script. The version of hadoop must be the same as the - one running HDFS. | -/usr/bin/hadoop | -
-d | ---days | -Delete log files older than the specified number of days | -7 | -
-c | ---config | -Path to the Hadoop configuration directory, under which hadoop-site.xml resides. - The hadoop-site.xml must point to the HDFS NameNode from where logs are to be removed. | -/home/foo/hadoop/conf | -
-l | ---logs | -A HDFS path, this must be the same HDFS path as specified for the log-destination-uri, - as mentioned under hodring options, - without the hdfs:// URI string | -/user | -
-n | ---dynamicdfs | -If true, this will indicate that the logcondense.py script should delete HDFS logs - in addition to MapReduce logs. Otherwise, it only deletes MapReduce logs, which is also the - default if this option is not specified. This option is useful if - dynamic HDFS installations - are being provisioned by HOD, and the static HDFS installation is being used only to collect - logs - a scenario that may be common in test clusters. | -false | -
-r | ---retain-master-logs | -If true, this will keep the JobTracker logs of job in hod-logs inside HDFS and it - will delete only the TaskTracker logs. Also, this will keep the Namenode logs along with - JobTracker logs and will only delete the Datanode logs if 'dynamicdfs' options is set - to true. Otherwise, it will delete the complete job directory from hod-logs inside - HDFS. By default it is set to false. | -false | -
So, for example, to delete all log files older than 7 days using a hadoop-site.xml stored in - ~/hadoop-conf, using the hadoop installation under ~/hadoop-0.17.0, you could say:
-python logcondense.py -p ~/hadoop-0.17.0/bin/hadoop -d 7 -c ~/hadoop-conf -l /user
-checklimits.sh is a HOD tool specific to the Torque/Maui environment - (Maui Cluster Scheduler is an open source job - scheduler for clusters and supercomputers, from clusterresources). The - checklimits.sh script - updates the torque comment field when newly submitted job(s) violate or - exceed - over user limits set up in Maui scheduler. It uses qstat, does one pass - over the torque job-list to determine queued or unfinished jobs, runs Maui - tool checkjob on each job to see if user limits are violated and then - runs torque's qalter utility to update job attribute 'comment'. Currently - it updates the comment as User-limits exceeded. Requested:([0-9]*) - Used:([0-9]*) MaxLimit:([0-9]*) for those jobs that violate limits. - This comment field is then used by HOD to behave accordingly depending on - the type of violation.
-checklimits.sh is available under the hod_install_location/support - folder. This shell script can be run directly as sh - checklimits.sh or as ./checklimits.sh after enabling - execute permissions. Torque and Maui binaries should be available - on the machine where the tool is run and should be in the path - of the shell script process. To update the - comment field of jobs from different users, this tool must be run with - torque administrative privileges. This tool must be run repeatedly - after specific intervals of time to frequently update jobs violating - constraints, for example via cron. Please note that the resource manager - and scheduler commands used in this script can be expensive and so - it is better not to run this inside a tight loop without sleeping.
-Production systems use accounting packages to charge users for using - shared compute resources. HOD supports a parameter - resource_manager.pbs-account to allow users to identify the - account under which they would like to submit jobs. It may be necessary - to verify that this account is a valid one configured in an accounting - system. The hod-install-dir/bin/verify-account script - provides a mechanism to plug-in a custom script that can do this - verification.
- -HOD runs the verify-account script passing in the - resource_manager.pbs-account value as argument to the script, - before allocating a cluster. Sites can write a script that verify this - account against their accounting systems. Returning a non-zero exit - code from this script will cause HOD to fail allocation. Also, in - case of an error, HOD will print the output of script to the user. - Any descriptive error message can be passed to the user from the - script in this manner.
-The default script that comes with the HOD installation does not - do any validation, and returns a zero exit code.
-If the verify-account script is not found, then HOD will treat - that verification is disabled, and continue allocation as is.
-This section discusses how to work with the HOD configuration options.
- -Configuration options can be specified in two ways: as a configuration file - in the INI format and as command line options to the HOD shell, - specified in the format --section.option[=value]. If the same option is - specified in both places, the value specified on the command line - overrides the value in the configuration file.
- -To get a simple description of all configuration options use:
- - -HOD organizes configuration options into these sections:
- -