diff --git a/hadoop-common-project/hadoop-common/CHANGES.txt b/hadoop-common-project/hadoop-common/CHANGES.txt index e37b87cf34..87b551703c 100644 --- a/hadoop-common-project/hadoop-common/CHANGES.txt +++ b/hadoop-common-project/hadoop-common/CHANGES.txt @@ -363,7 +363,19 @@ Release 2.3.0 - UNRELEASED HADOOP-9908. Fix NPE when versioninfo properties file is missing (todd) -Release 2.1.1-beta - UNRELEASED +Release 2.2.0 - UNRELEASED + + INCOMPATIBLE CHANGES + + NEW FEATURES + + IMPROVEMENTS + + OPTIMIZATIONS + + BUG FIXES + +Release 2.1.1-beta - 2013-09-23 INCOMPATIBLE CHANGES diff --git a/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html b/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html index 0f52e1ae49..2ef6c3608c 100644 --- a/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html +++ b/hadoop-common-project/hadoop-common/src/main/docs/releasenotes.html @@ -1,4 +1,1132 @@ +
Running TestContainerLogsPage on trunk while Native IO is enabled makes it fail
The {{appFinished}} method is not being called when applications have finished. This causes a couple of leaks as {{oldMasterKeys}} and {{appToAppAttemptMap}} are never being pruned. +
preemption is enabled. +Queue = a,b +a capacity = 30% +b capacity = 70% + +Step 1: Assign a big job to queue a ( so that job_a will utilize some resources from queue b) +Step 2: Assigne a big job to queue b. + +Following exception is thrown at Resource Manager +{noformat} +2013-09-12 10:42:32,535 ERROR [SchedulingMonitor (ProportionalCapacityPreemptionPolicy)] yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(68)) - Thread Thread[SchedulingMonitor (ProportionalCapacityPreemptionPolicy),5,main] threw an Exception. +java.lang.ClassCastException: java.util.Collections$UnmodifiableSet cannot be cast to java.util.NavigableSet + at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.getContainersToPreempt(ProportionalCapacityPreemptionPolicy.java:403) + at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:202) + at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:173) + at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:72) + at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PreemptionChecker.run(SchedulingMonitor.java:82) + at java.lang.Thread.run(Thread.java:662) + +{noformat}
In the web services api for the cluster/metrics, the totalNodes reported doesn't include the unhealthy nodes. + +this.totalNodes = activeNodes + lostNodes + decommissionedNodes + + rebootedNodes;
yarn proto definitions should specify package as 'hadoop.yarn' similar to protos with 'hadoop.common' & 'hadoop.hdfs' in Common & HDFS respectively.
On a secure cluster, an invalid key to HMAC error is thrown when trying to get an application report for an application with an attempt that has unregistered.
Unmanaged AMs do not run in the cluster, their tracking URL should not be proxy-fied.
Currently container-executor.c has a banned set of users (mapred, hdfs & bin) and configurable min.user.id (defaulting to 1000). + +This presents a problem for systems that run as system users (below 1000) if these systems want to start containers. + +Systems like Impala fit in this category. A (local) 'impala' system user is created when installing Impala on the nodes. + +Note that the same thing happens when installing system like HDFS, Yarn, Oozie, from packages (Bigtop); local system users are created. + +For Impala to be able to run containers in a secure cluster, the 'impala' system user must whitelisted. + +For this, adding a configuration 'allowed.system.users' option in the container-executor.cfg and the logic in container-executor.c would allow the usernames in that list. + +Because system users are not guaranteed to have the same UID in different machines, the 'allowed.system.users' property should use usernames and not UIDs. +
Today we are just listing application in RUNNING state by default for "yarn application -list". Instead we should show all the applications which are either submitted/accepted/running.
In YARN-557, we added some code to make {{ApplicationConstants.Environment.USER}} has OS-specific definition in order to fix the unit test TestUnmanagedAMLauncher. In YARN-571, the relevant test code was corrected. In YARN-602, we actually will explicitly set the environment variables for the child containers. With these changes, I think we can revert the YARN-557 change to make {{ApplicationConstants.Environment.USER}} OS neutral. The main benefit is that we can use the same method over the Enum constants. This should also fix the TestContainerLaunch#testContainerEnvVariables failure on Windows.
There is standardization of help message in YARN-1080. It is nice to have similar changes for $ yarn appications and yarn node
The AMRMTokens are now only saved in RMStateStore and not populated back to AMRMTokenSecretManager after RM restarts. This is more needed now since AMRMToken also becomes used in non-secure env.
If secure RM with recovery enabled is restarted while oozie jobs are running rm fails to come up.
The issue is in RMNodeImpl where both RUNNING and UNHEALTHY states that transition to a deactive state (LOST, DECOMMISSIONED, REBOOTED) use the same DeactivateNodeTransition class. The DeactivateNodeTransition class naturally decrements the active node, however the in cases where the node has transition to UNHEALTHY the active count has already been decremented.
Enable rmrestart feature And restart Resorce Manager while a job is running. + +Resorce Manager fails to start with below error + +2013-08-23 17:57:40,705 INFO resourcemanager.RMAppManager (RMAppManager.java:recover(370)) - Recovering application application_1377280618693_0001 +2013-08-23 17:57:40,763 ERROR resourcemanager.ResourceManager (ResourceManager.java:serviceStart(617)) - Failed to load/recover state +java.lang.NullPointerException + at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.setTimerForTokenRenewal(DelegationTokenRenewer.java:371) + at org.apache.hadoop.yarn.server.resourcemanager.security.DelegationTokenRenewer.addApplication(DelegationTokenRenewer.java:307) + at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:291) + at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:371) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:819) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:613) + at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:832) +2013-08-23 17:57:40,766 INFO util.ExitUtil (ExitUtil.java:terminate(124)) - Exiting with status 1 + + +
The fair scheduler is still evolving, but the current documentation contains some inaccuracies. + + +
In kerberos setup it's expected for a http client to authenticate to kerberos before allowing user to browse any information.
if 'yarn.nm.liveness-monitor.expiry-interval-ms' is set to less than heartbeat iterval, all the node managers will be added in 'Lost Nodes' + +Instead, Resource Manager should validate these property and It should fail to start if combination of such property is invalid.
Output of $ yarn node -list shows number of running containers at each node. I found a case when new user of YARN thinks that this is container ID, use it later in other YARN commands and find an error due to misunderstanding. + +{code:title=current output} +2013-07-31 04:00:37,814|beaver.machine|INFO|RUNNING: /usr/bin/yarn node -list +2013-07-31 04:00:38,746|beaver.machine|INFO|Total Nodes:1 +2013-07-31 04:00:38,747|beaver.machine|INFO|Node-Id Node-State Node-Http-Address Running-Containers +2013-07-31 04:00:38,747|beaver.machine|INFO|myhost:45454 RUNNING myhost:50060 2 +{code} + +{code:title=proposed output} +2013-07-31 04:00:37,814|beaver.machine|INFO|RUNNING: /usr/bin/yarn node -list +2013-07-31 04:00:38,746|beaver.machine|INFO|Total Nodes:1 +2013-07-31 04:00:38,747|beaver.machine|INFO|Node-Id Node-State Node-Http-Address Number-of-Running-Containers +2013-07-31 04:00:38,747|beaver.machine|INFO|myhost:45454 RUNNING myhost:50060 2 +{code}
There are 2 parts I am proposing in this jira. They can be fixed together in one patch. + +1. Standardize help message for required parameter of $ yarn logs +YARN CLI has a command "logs" ($ yarn logs). The command always requires a parameter of "-applicationId <arg>". However, help message of the command does not make it clear. It lists -applicationId as optional parameter. If I don't set it, YARN CLI will complain this is missing. It is better to use standard required notation used in other Linux command for help message. Any user familiar to the command can understand that this parameter is needed more easily. + +{code:title=current help message} +-bash-4.1$ yarn logs +usage: general options are: + -applicationId <arg> ApplicationId (required) + -appOwner <arg> AppOwner (assumed to be current user if not + specified) + -containerId <arg> ContainerId (must be specified if node address is + specified) + -nodeAddress <arg> NodeAddress in the format nodename:port (must be + specified if container id is specified) +{code} + +{code:title=proposed help message} +-bash-4.1$ yarn logs +usage: yarn logs -applicationId <application ID> [OPTIONS] +general options are: + -appOwner <arg> AppOwner (assumed to be current user if not + specified) + -containerId <arg> ContainerId (must be specified if node address is + specified) + -nodeAddress <arg> NodeAddress in the format nodename:port (must be + specified if container id is specified) +{code} + +2. Add description for help command. As far as I know, a user cannot get logs for running job. Since I spent some time trying to get logs of running applications, it should be nice to say this in command description. +{code:title=proposed help} +Retrieve logs for completed/killed YARN application +usage: general options are... +{code} +
The three unit tests fail on Windows due to host name resolution differences on Windows, i.e. 127.0.0.1 does not resolve to host name "localhost". + +{noformat} +org.apache.hadoop.security.token.SecretManager$InvalidToken: Given Container container_0_0000_01_000000 identifier is not valid for current Node manager. Expected : 127.0.0.1:12345 Found : localhost:12345 +{noformat} + +{noformat} +testNMConnectionToRM(org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater) Time elapsed: 8343 sec <<< FAILURE! +org.junit.ComparisonFailure: expected:<[localhost]:12345> but was:<[127.0.0.1]:12345> + at org.junit.Assert.assertEquals(Assert.java:125) + at org.junit.Assert.assertEquals(Assert.java:147) + at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyResourceTracker6.registerNodeManager(TestNodeStatusUpdater.java:712) + at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) + at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) + at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) + at java.lang.reflect.Method.invoke(Method.java:597) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) + at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:101) + at $Proxy26.registerNodeManager(Unknown Source) + at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:212) + at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:149) + at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater$MyNodeStatusUpdater4.serviceStart(TestNodeStatusUpdater.java:369) + at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) + at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:101) + at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:213) + at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) + at org.apache.hadoop.yarn.server.nodemanager.TestNodeStatusUpdater.testNMConnectionToRM(TestNodeStatusUpdater.java:985) +{noformat}
Several cases in this unit tests fail on Windows. (Append error log at the end.) + +testInvalidEnvSyntaxDiagnostics fails because the difference between cmd and bash script error handling. If some command fails in the cmd script, cmd will continue execute the the rest of the script command. Error handling needs to be explicitly carried out in the script file. The error code of the last command will be returned as the error code of the whole script. In this test, some error happened in the middle of the cmd script, the test expect an exception and non-zero error code. In the cmd script, the intermediate errors are ignored. The last command "call" succeeded and there is no exception. + +testContainerLaunchStdoutAndStderrDiagnostics fails due to wrong cmd commands used by the test. + +testContainerEnvVariables and testDelayedKill fail due to a regression from YARN-906. + +{noformat} +------------------------------------------------------------------------------- +Test set: org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch +------------------------------------------------------------------------------- +Tests run: 7, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 11.526 sec <<< FAILURE! +testInvalidEnvSyntaxDiagnostics(org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch) Time elapsed: 583 sec <<< FAILURE! +junit.framework.AssertionFailedError: Should catch exception + at junit.framework.Assert.fail(Assert.java:50) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch.testInvalidEnvSyntaxDiagnostics(TestContainerLaunch.java:269) +... + +testContainerLaunchStdoutAndStderrDiagnostics(org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch) Time elapsed: 561 sec <<< FAILURE! +junit.framework.AssertionFailedError: Should catch exception + at junit.framework.Assert.fail(Assert.java:50) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch.testContainerLaunchStdoutAndStderrDiagnostics(TestContainerLaunch.java:314) +... + +testContainerEnvVariables(org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch) Time elapsed: 4136 sec <<< FAILURE! +junit.framework.AssertionFailedError: expected:<137> but was:<143> + at junit.framework.Assert.fail(Assert.java:50) + at junit.framework.Assert.failNotEquals(Assert.java:287) + at junit.framework.Assert.assertEquals(Assert.java:67) + at junit.framework.Assert.assertEquals(Assert.java:199) + at junit.framework.Assert.assertEquals(Assert.java:205) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch.testContainerEnvVariables(TestContainerLaunch.java:500) +... + +testDelayedKill(org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch) Time elapsed: 2744 sec <<< FAILURE! +junit.framework.AssertionFailedError: expected:<137> but was:<143> + at junit.framework.Assert.fail(Assert.java:50) + at junit.framework.Assert.failNotEquals(Assert.java:287) + at junit.framework.Assert.assertEquals(Assert.java:67) + at junit.framework.Assert.assertEquals(Assert.java:199) + at junit.framework.Assert.assertEquals(Assert.java:205) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.TestContainerLaunch.testDelayedKill(TestContainerLaunch.java:601) +... +{noformat} +
Once a user brings up YARN daemon, runs jobs, jobs will stay in output returned by $ yarn application -list even after jobs complete already. We want YARN command line to clean up this list. Specifically, we want to remove applications with FINISHED state(not Final-State) or KILLED state from the result. + +{code} +[user1@host1 ~]$ yarn application -list +Total Applications:150 + Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL +application_1374638600275_0109 Sleep job MAPREDUCE user1 default KILLED KILLED 100% host1:54059 +application_1374638600275_0121 Sleep job MAPREDUCE user1 default FINISHED SUCCEEDED 100% host1:19888/jobhistory/job/job_1374638600275_0121 +application_1374638600275_0020 Sleep job MAPREDUCE user1 default FINISHED SUCCEEDED 100% host1:19888/jobhistory/job/job_1374638600275_0020 +application_1374638600275_0038 Sleep job MAPREDUCE user1 default +.... +{code} +
With the current behavior is impossible to determine if a container has been preempted or lost due to a NM crash. + +Adding a PREEMPTED exit status (-102) will help an AM determine that a container has been preempted. + +Note the change of scope from the original summary/description. The original scope proposed API/behavior changes. Because we are passed 2.1.0-beta I'm reducing the scope of this JIRA.
The YARN Fair Scheduler is largely stable now, and should no longer be declared experimental.
ResourceManager and NodeManager do not have the correct setting for java.library.path when launched on Windows. This prevents the processes from loading native code from hadoop.dll. The native code is required for correct functioning on Windows (not optional), so this ultimately can cause failures.
While the NMs are keyed using the NodeId, the allocation is done based on the hostname. + +This makes the different nodes indistinguishable to the scheduler. + +There should be an option to enabled the host:port instead just port for allocations. The nodes reported to the AM should report the 'key' (host or host:port). +
The nodes web page which list all the connected nodes of the cluster is broken. + +1. The page is not showing in correct format/style. +2. If we restart the NM, the node list is not refreshed, but just add the new started NM to the list. The old NMs information still remain.
In Ambari we plan to show for MR2 the number of applications finished, running, waiting, etc. It would be efficient if YARN could provide per application-type and state aggregated counts. +
YARN-654 performs sanity checks for parameters of public methods in AMRMClient. Those may create runtime exception. +Currently, heartBeat thread in AMRMClientAsync only captures IOException and YarnException, and will not handle Runtime Exception properly. +Possible solution can be: heartbeat thread will catch throwable and notify the callbackhandler thread via existing savedException
In ContainerImpl.getLocalizedResources(), there's: +{code} +assert ContainerState.LOCALIZED == getContainerState(); // TODO: FIXME!! +{code} + +ContainerImpl.getLocalizedResources() is called in ContainerLaunch.call(), which is scheduled on a separate thread. If the container is not at LOCALIZED (e.g. it is at KILLING, see YARN-906), an AssertError will be thrown and fails the thread without notifying NM. Therefore, the container cannot receive more events, which are supposed to be sent from ContainerLaunch.call(), and move towards completion.
I have 2 node managers. +* one with 1024 MB memory.(nm1) +* second with 2048 MB memory.(nm2) +I am submitting simple map reduce application with 1 mapper and one reducer with 1024mb each. The steps to reproduce this are +* stop nm2 with 2048MB memory.( This I am doing to make sure that this node's heartbeat doesn't reach RM first). +* now submit application. As soon as it receives first node's (nm1) heartbeat it will try to reserve memory for AM-container (2048MB). However it has only 1024MB of memory. +* now start nm2 with 2048 MB memory. + +It hangs forever... Ideally this has two potential issues. +* It should not try to reserve memory on a node manager which is never going to give requested memory. i.e. Current max capability of node manager is 1024MB but 2048MB is reserved on it. But it still does that. +* Say 2048MB is reserved on nm1 but nm2 comes back with 2048MB available memory. In this case if the original request was made without any locality then scheduler should unreserve memory on nm1 and allocate requested 2048MB container on nm2. +
At present we are blinding passing the allocate request containing containers to be released to the scheduler. This may result into one application releasing another application's container. + +{code} + @Override + @Lock(Lock.NoLock.class) + public Allocation allocate(ApplicationAttemptId applicationAttemptId, + List<ResourceRequest> ask, List<ContainerId> release, + List<String> blacklistAdditions, List<String> blacklistRemovals) { + + FiCaSchedulerApp application = getApplication(applicationAttemptId); +.... +.... + // Release containers + for (ContainerId releasedContainerId : release) { + RMContainer rmContainer = getRMContainer(releasedContainerId); + if (rmContainer == null) { + RMAuditLogger.logFailure(application.getUser(), + AuditConstants.RELEASE_CONTAINER, + "Unauthorized access or invalid container", "CapacityScheduler", + "Trying to release container not owned by app or with invalid id", + application.getApplicationId(), releasedContainerId); + } + completedContainer(rmContainer, + SchedulerUtils.createAbnormalContainerStatus( + releasedContainerId, + SchedulerUtils.RELEASED_CONTAINER), + RMContainerEventType.RELEASED); + } +{code} + +Current checks are not sufficient and we should prevent this..... thoughts?
locality.threshold.node and locality.threshold.rack should have the yarn.scheduler.fair prefix like the items before them + +http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
Making container start and completion events available to auxiliary services would allow them to be resource-aware. The auxiliary service would be able to notify a co-located service that is opportunistically using free capacity of allocation changes.
See https://builds.apache.org/job/PreCommit-YARN-Build/1435//testReport/org.apache.hadoop.yarn.client.api.impl/TestNMClient/testNMClientNoCleanupOnStop/
I have tried running DistributedShell and also used ApplicationMaster of the same for my test. +The application is successfully running through logging some errors which would be useful to fix. +Below are the logs from NodeManager and ApplicationMasterode + +Log Snippet for NodeManager +============================= +2013-07-07 13:39:18,787 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Connecting to ResourceManager at localhost/127.0.0.1:9990. current no. of attempts is 1 +2013-07-07 13:39:19,050 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -325382586 +2013-07-07 13:39:19,052 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for nm-tokens, got key with id :1005046570 +2013-07-07 13:39:19,053 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as sunny-Inspiron:9993 with total resource of <memory:10240, vCores:8> +2013-07-07 13:39:19,053 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests +2013-07-07 13:39:35,256 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1373184544832_0001_000001 (auth:SIMPLE) +2013-07-07 13:39:35,492 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1373184544832_0001_01_000001 by user sunny +2013-07-07 13:39:35,507 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Creating a new application reference for app application_1373184544832_0001 +2013-07-07 13:39:35,511 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny IP=127.0.0.1 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1373184544832_0001 CONTAINERID=container_1373184544832_0001_01_000001 +2013-07-07 13:39:35,511 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1373184544832_0001 transitioned from NEW to INITING +2013-07-07 13:39:35,512 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1373184544832_0001_01_000001 to application application_1373184544832_0001 +2013-07-07 13:39:35,518 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1373184544832_0001 transitioned from INITING to RUNNING +2013-07-07 13:39:35,528 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000001 transitioned from NEW to LOCALIZING +2013-07-07 13:39:35,540 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://localhost:9000/application/test.jar transitioned from INIT to DOWNLOADING +2013-07-07 13:39:35,540 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Created localizer for container_1373184544832_0001_01_000001 +2013-07-07 13:39:35,675 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Writing credentials to the nmPrivate file /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544832_0001_01_000001.tokens. Credentials list: +2013-07-07 13:39:35,694 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Initializing user sunny +2013-07-07 13:39:35,803 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Copying from /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/nmPrivate/container_1373184544832_0001_01_000001.tokens to /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/usercache/sunny/appcache/application_1373184544832_0001/container_1373184544832_0001_01_000001.tokens +2013-07-07 13:39:35,803 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: CWD set to /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/usercache/sunny/appcache/application_1373184544832_0001 = file:/home/sunny/Hadoop2/hadoopdata/nodemanagerdata/usercache/sunny/appcache/application_1373184544832_0001 +2013-07-07 13:39:36,136 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:36,406 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://localhost:9000/application/test.jar transitioned from DOWNLOADING to LOCALIZED +2013-07-07 13:39:36,409 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000001 transitioned from LOCALIZING to LOCALIZED +2013-07-07 13:39:36,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000001 transitioned from LOCALIZED to RUNNING +2013-07-07 13:39:36,692 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, -c, /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/usercache/sunny/appcache/application_1373184544832_0001/container_1373184544832_0001_01_000001/default_container_executor.sh] +2013-07-07 13:39:37,144 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:38,147 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:39,151 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:39,209 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1373184544832_0001_01_000001 +2013-07-07 13:39:39,259 WARN org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: Unexpected: procfs stat file is not in the expected format for process with pid 11552 +2013-07-07 13:39:39,264 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 29524 for container-id container_1373184544832_0001_01_000001: 79.9 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used +2013-07-07 13:39:39,645 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1373184544832_0001_000001 (auth:SIMPLE) +2013-07-07 13:39:39,651 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Start request for container_1373184544832_0001_01_000002 by user sunny +2013-07-07 13:39:39,651 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny IP=127.0.0.1 OPERATION=Start Container Request TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1373184544832_0001 CONTAINERID=container_1373184544832_0001_01_000002 +2013-07-07 13:39:39,651 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Adding container_1373184544832_0001_01_000002 to application application_1373184544832_0001 +2013-07-07 13:39:39,652 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000002 transitioned from NEW to LOCALIZED +2013-07-07 13:39:39,660 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Getting container-status for container_1373184544832_0001_01_000002 +2013-07-07 13:39:39,661 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Returning container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 2, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:39,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000002 transitioned from LOCALIZED to RUNNING +2013-07-07 13:39:39,873 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, -c, /home/sunny/Hadoop2/hadoopdata/nodemanagerdata/usercache/sunny/appcache/application_1373184544832_0001/container_1373184544832_0001_01_000002/default_container_executor.sh] +2013-07-07 13:39:39,898 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1373184544832_0001_01_000002 succeeded +2013-07-07 13:39:39,899 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000002 transitioned from RUNNING to EXITED_WITH_SUCCESS +2013-07-07 13:39:39,900 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1373184544832_0001_01_000002 +2013-07-07 13:39:39,942 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny OPERATION=Container Finished - Succeeded TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1373184544832_0001 CONTAINERID=container_1373184544832_0001_01_000002 +2013-07-07 13:39:39,943 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000002 transitioned from EXITED_WITH_SUCCESS to DONE +2013-07-07 13:39:39,944 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1373184544832_0001_01_000002 from application application_1373184544832_0001 +2013-07-07 13:39:40,155 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:40,157 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 2, }, state: C_COMPLETE, diagnostics: "", exit_status: 0, +2013-07-07 13:39:40,158 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed container container_1373184544832_0001_01_000002 +2013-07-07 13:39:40,683 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Getting container-status for container_1373184544832_0001_01_000002 +2013-07-07 13:39:40,686 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:appattempt_1373184544832_0001_000001 (auth:TOKEN) cause:org.apache.hadoop.yarn.exceptions.YarnException: Container container_1373184544832_0001_01_000002 is not handled by this NodeManager +2013-07-07 13:39:40,687 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9993, call org.apache.hadoop.yarn.api.ContainerManagementProtocolPB.stopContainer from 127.0.0.1:51085: error: org.apache.hadoop.yarn.exceptions.YarnException: Container container_1373184544832_0001_01_000002 is not handled by this NodeManager +org.apache.hadoop.yarn.exceptions.YarnException: Container container_1373184544832_0001_01_000002 is not handled by this NodeManager + at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeGetAndStopContainerRequest(ContainerManagerImpl.java:614) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainer(ContainerManagerImpl.java:538) + at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainer(ContainerManagementProtocolPBServiceImpl.java:88) + at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:85) + at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1868) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1864) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:396) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1862) +2013-07-07 13:39:41,162 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_RUNNING, diagnostics: "", exit_status: -1000, +2013-07-07 13:39:41,691 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container container_1373184544832_0001_01_000001 succeeded +2013-07-07 13:39:41,692 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000001 transitioned from RUNNING to EXITED_WITH_SUCCESS +2013-07-07 13:39:41,692 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1373184544832_0001_01_000001 +2013-07-07 13:39:41,714 INFO org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=sunny OPERATION=Container Finished - Succeeded TARGET=ContainerImpl RESULT=SUCCESS APPID=application_1373184544832_0001 CONTAINERID=container_1373184544832_0001_01_000001 +2013-07-07 13:39:41,714 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1373184544832_0001_01_000001 transitioned from EXITED_WITH_SUCCESS to DONE +2013-07-07 13:39:41,714 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Removing container_1373184544832_0001_01_000001 from application application_1373184544832_0001 +2013-07-07 13:39:42,166 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out status for container: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1373184544832, }, attemptId: 1, }, id: 1, }, state: C_COMPLETE, diagnostics: "", exit_status: 0, +2013-07-07 13:39:42,166 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed container container_1373184544832_0001_01_000001 +2013-07-07 13:39:42,191 INFO SecurityLogger.org.apache.hadoop.ipc.Server: Auth successful for appattempt_1373184544832_0001_000001 (auth:SIMPLE) +2013-07-07 13:39:42,195 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Getting container-status for container_1373184544832_0001_01_000001 +2013-07-07 13:39:42,196 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:appattempt_1373184544832_0001_000001 (auth:TOKEN) cause:org.apache.hadoop.yarn.exceptions.YarnException: Container container_1373184544832_0001_01_000001 is not handled by this NodeManager +2013-07-07 13:39:42,196 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9993, call org.apache.hadoop.yarn.api.ContainerManagementProtocolPB.stopContainer from 127.0.0.1:51086: error: org.apache.hadoop.yarn.exceptions.YarnException: Container container_1373184544832_0001_01_000001 is not handled by this NodeManager +org.apache.hadoop.yarn.exceptions.YarnException: Container container_1373184544832_0001_01_000001 is not handled by this NodeManager + at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:45) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.authorizeGetAndStopContainerRequest(ContainerManagerImpl.java:614) + at org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl.stopContainer(ContainerManagerImpl.java:538) + at org.apache.hadoop.yarn.api.impl.pb.service.ContainerManagementProtocolPBServiceImpl.stopContainer(ContainerManagementProtocolPBServiceImpl.java:88) + at org.apache.hadoop.yarn.proto.ContainerManagementProtocol$ContainerManagementProtocolService$2.callBlockingMethod(ContainerManagementProtocol.java:85) + at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:605) + at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1033) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1868) + at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1864) + at java.security.AccessController.doPrivileged(Native Method) + at javax.security.auth.Subject.doAs(Subject.java:396) + at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1489) + at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1862) +2013-07-07 13:39:42,264 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Starting resource-monitoring for container_1373184544832_0001_01_000002 +2013-07-07 13:39:42,265 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1373184544832_0001_01_000002 +2013-07-07 13:39:42,265 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1373184544832_0001_01_000001 +2013-07-07 13:39:43,173 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1373184544832_0001 transitioned from RUNNING to APPLICATION_RESOURCES_CLEANINGUP +2013-07-07 13:39:43,174 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got event APPLICATION_STOP for appId application_1373184544832_0001 +2013-07-07 13:39:43,180 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application: Application application_1373184544832_0001 transitioned from APPLICATION_RESOURCES_CLEANINGUP to FINISHED +2013-07-07 13:39:43,180 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler: Scheduling Log Deletion for application: application_1373184544832_0001, with delay of 10800 seconds + + +Log Snippet for Application Manager +================================== +13/07/07 13:39:36 INFO client.SimpleApplicationMaster: Initializing ApplicationMaster +13/07/07 13:39:37 INFO client.SimpleApplicationMaster: Application master for app, appId=1, clustertimestamp=1373184544832, attemptId=1 +13/07/07 13:39:37 INFO client.SimpleApplicationMaster: Starting ApplicationMaster +13/07/07 13:39:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable +13/07/07 13:39:37 INFO impl.NMClientAsyncImpl: Upper bound of the thread pool size is 500 +13/07/07 13:39:37 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-nodemanagers-proxies : 500 +13/07/07 13:39:37 INFO client.SimpleApplicationMaster: Max mem capabililty of resources in this cluster 8192 +13/07/07 13:39:37 INFO client.SimpleApplicationMaster: Requested container ask: Capability[<memory:100, vCores:0>]Priority[0]ContainerCount[1] +13/07/07 13:39:39 INFO client.SimpleApplicationMaster: Got response from RM for container ask, allocatedCnt=1 +13/07/07 13:39:39 INFO client.SimpleApplicationMaster: Launching shell command on a new container., containerId=container_1373184544832_0001_01_000002, containerNode=sunny-Inspiron:9993, containerNodeURI=sunny-Inspiron:8042, containerResourceMemory1024 +13/07/07 13:39:39 INFO client.SimpleApplicationMaster: Setting up container launch container for containerid=container_1373184544832_0001_01_000002 +13/07/07 13:39:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1373184544832_0001_01_000002 +13/07/07 13:39:39 INFO impl.ContainerManagementProtocolProxy: Opening proxy : sunny-Inspiron:9993 +13/07/07 13:39:39 INFO client.SimpleApplicationMaster: Succeeded to start Container container_1373184544832_0001_01_000002 +13/07/07 13:39:39 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1373184544832_0001_01_000002 +13/07/07 13:39:40 INFO client.SimpleApplicationMaster: Got response from RM for container ask, completedCnt=1 +13/07/07 13:39:40 INFO client.SimpleApplicationMaster: Got container status for containerID=container_1373184544832_0001_01_000002, state=COMPLETE, exitStatus=0, diagnostics= +13/07/07 13:39:40 INFO client.SimpleApplicationMaster: Container completed successfully., containerId=container_1373184544832_0001_01_000002 +13/07/07 13:39:40 INFO client.SimpleApplicationMaster: Application completed. Stopping running containers +13/07/07 13:39:40 ERROR impl.NMClientImpl: Failed to stop Container container_1373184544832_0001_01_000002when stopping NMClientImpl +13/07/07 13:39:40 INFO impl.ContainerManagementProtocolProxy: Closing proxy : sunny-Inspiron:9993 +13/07/07 13:39:40 INFO client.SimpleApplicationMaster: Application completed. Signalling finish to RM +13/07/07 13:39:41 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue +java.lang.InterruptedException + at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1899) + at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1934) + at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) + at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:281) +13/07/07 13:39:41 INFO client.SimpleApplicationMaster: Application Master completed successfully. exiting + + +
if lower int value means higher priority, shouldn't we "return other.getPriority() - this.getPriority() "
After YARN-750 AMRMClient should support blacklisting via the new YARN API's
YARN-757 got fixed by changing the scheduler from Fair to default (which is capacity).
If user info is present in the client token then it can be used to do limited authz in the AM.
Within the YARN Resource Manager REST API the GET call which returns all Applications can be filtered by a single State query parameter (http://<rm http address:port>/ws/v1/cluster/apps). + +There are 8 possible states (New, Submitted, Accepted, Running, Finishing, Finished, Failed, Killed), if no state parameter is specified all states are returned, however if a sub-set of states is required then multiple REST calls are required (max. of 7). + +The proposal is to be able to specify multiple states in a single REST call.
The jira is tracking why appToken and clientToAMToken is removed separately, and why they are distributed in different transitions, ideally there may be a common place where these two tokens can be removed at the same time.
NodeManager should mandatorily set some Environment variables into every containers that it launches, such as Environment.user, Environment.pwd. If both users and NodeManager set those variables, the value set by NM should be used
The fair scheduler should have an HTTP interface that exposes information such as applications per queue, fair shares, demands, current allocations.
PublicLocalizer +1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ). + +PrivateLocalizer +1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). Also update method should be fixed. It too is sharing pending list. +
When job succeeds and successfully call finishApplicationMaster, RM shutdown and restart-dispatcher is stopped before it can process REMOVE_APP event. The next time RM comes back, it will reload the existing state files even though the job is succeeded
While running some test and adding/removing nodes, we see RM crashed with the below exception. We are testing with fair scheduler and running hadoop-2.0.3-alpha + +{noformat} +2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Deactivating Node YYYY:55680 as it is now LOST +2013-03-22 18:54:27,015 INFO org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: YYYY:55680 Node Transitioned from UNHEALTHY to LOST +2013-03-22 18:54:27,015 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_REMOVED to the scheduler +java.lang.NullPointerException + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeNode(FairScheduler.java:619) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:856) + at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:98) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:375) + at java.lang.Thread.run(Thread.java:662) +2013-03-22 18:54:27,016 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye.. +2013-03-22 18:54:27,020 INFO org.mortbay.log: Stopped SelectChannelConnector@XXXX:50030 +{noformat}
When the ResourceManager kills an application, it leaves the proxy URL redirecting to the original tracking URL for the application even though the ApplicationMaster is no longer there to service it. It should redirect it somewhere more useful, like the RM's web page for the application, where the user can find that the application was killed and links to the AM logs. + +In addition, sometimes the AM during teardown from the kill can attempt to unregister and provide an updated tracking URL, but unfortunately the RM has "forgotten" the AM due to the kill and refuses to process the unregistration. Instead it logs: + +{noformat} +2013-01-09 17:37:49,671 [IPC Server handler 2 on 8030] ERROR +org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: AppAttemptId doesnt exist in cache appattempt_1357575694478_28614_000001 +{noformat} + +It should go ahead and process the unregistration to update the tracking URL since the application offered it.
{code:xml} +2012-12-26 08:41:15,030 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler: Calling allocate on removed or non existant application appattempt_1356385141279_49525_000001 +2012-12-26 08:41:15,031 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type CONTAINER_ALLOCATED for applicationAttempt application_1356385141279_49525 +java.lang.ArrayIndexOutOfBoundsException: 0 + at java.util.Arrays$ArrayList.get(Arrays.java:3381) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:655) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:644) + at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:357) + at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:298) + at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43) + at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:490) + at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:80) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:433) + at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:414) + at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:126) + at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:75) + at java.lang.Thread.run(Thread.java:662) + {code}
Used for testing when NameNode HA is enabled. Users can use a new configuration property "dfs.client.test.drop.namenode.response.number" to specify the number of responses that DFSClient will drop in each RPC call. This feature can help testing functionalities such as NameNode retry cache.