[hadoop]MAPREDUCE-3460 Report

1. Symptom

MapReduce application manager hangs.

1.1 Severity

Blocker.

1.2 Was there exception thrown?

Yes.

2013-07-31 15:36:21,668 INFO  [main] rm.RMContainerRequestor (RMContainerRequestor.java:containerFailedOnHost(215)) - 1 failures on node h1

2013-07-31 15:36:21,668 INFO  [main] rm.RMContainerRequestor (RMContainerRequestor.java:containerFailedOnHost(220)) - Blacklisted host h1

However, there is no exception msg when it hangs!

1.2.1 Were there multiple exceptions?

Yes. See above.

1.3 Affect scope:

One request (in a rare occation)

2. How to reproduce this failure

2.0 Version

0.23.0

2.1 Configuration

Standard

2.2 Reproduction procedure

At the beginning, do not start any nodemanager:

1). Start a simple wordcount job1;

2). Start host1’s nodemanager;

3). Wait until the container is assigned to host1, fail container1 (using simple script to print “ERROR” on host1).

 --- This will cause AM to blacklist host1

4). Start another simple job2;

   --- RM will allocate a container for this job2 on host1, but AM won’t accept this because host1 is blacklisted.

5). Start host2’s nodemanager;

6). We observe job1 continues and finishes on host2, but job2 hangs forever.

2.2.1 Timing order

In this order

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

Yes.

2.4 How many machines needed?

2 nodes. 4 componnents. AM, 2NM, RM

3. Diagnosis procedure

This is a long propagation error. So if you do not notice the intermediate “INFO” node fail msg, it is hard to understand the rootcause.

3.1 Detailed Symptom (where you start)

Hang, in wait state.

3.2 Backward inference

See the graph above. All key steps (numbered) are logged. Details below:

5). fail container1 on host1 -- causing AM/NM to blacklist RM

2013-07-31 15:36:21,668 INFO  [main] rm.RMContainerRequestor (RMContainerRequestor.java:containerFailedOnHost(215)) - 1 failures on node h1

2013-07-31 15:36:21,668 INFO  [main] rm.RMContainerRequestor (RMContainerRequestor.java:containerFailedOnHost(220)) - Blacklisted host h1

565 2013-08-01 15:35:57,778 INFO  [main] rm.RMContainerAllocator (RMContainerAllocator.java:addMap(664)) - Added attempt_1375385755744_0001_m_000000_1 to list of failed maps

8). New request:

 2013-08-01 15:44:23,630 DEBUG [main] rm.RMContainerAllocator (RMContainerAllocator.java:addMap(674)) - Added attempt req to host h1

10) AM heartbeat --- this time RM will reply with a container still on blacklisted host (host1), but AM/NM decides this is black listed, therefore tries to find a replacement container request, so the next time it will tell RM to find a new container. But bug occurs in this process:

AM log:

765 2013-08-01 15:44:23,665 INFO  [main] rm.RMContainerAllocator (RMContainerAllocator.java:heartbeat(202)) - After Scheduling: PendingReduces:0 ScheduledMaps:2 ScheduledReduces:0 AssignedMaps:1 AssignedReduces:0 completedMaps:0 completedReduces:0 containersAllocated:1 containersReleased:0 hostLocalAssigned:1 rackLocalAssigned:0 availableResources(headroom):memory: 22528

    --- AM received heartbeat reply from RM

769 2013-08-01 15:44:23,667 INFO  [main] rm.RMContainerAllocator (RMContainerAllocator.java:assign(703)) - Got allocated containers 1

   --- AM starts Container Allocation:

==================================================

================= Bug starts  ====================

RMContainerAllocator.java:

assign:

          blackListed = isNodeBlacklisted(allocatedHost);

          if (blackListed) {

            // we need to request for a new container

            // and release the current one

            LOG.info("Got allocated container on a blacklisted "

                + " host "+allocatedHost

                +". Releasing container " + allocated);

773 2013-08-01 15:44:23,669 INFO  [main] rm.RMContainerAllocator (RMContainerAllocator.java:assign(755)) - Got allocated container on a blacklisted host h1. Releasing container Container: [ContainerId: container_1375386260815_0001_01_000003, NodeId: h1:1234, NodeHttpAddress: h1:2, Resource: memory: 1024, Priority: org.apache.  hadoop.yarn.api.records.impl.pb.PriorityPBImpl@24, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp:1375386260815, }, attemptId: 1, }, id: 3, }, state: C_NEW, ]

            // find the request matching this allocated container

            // and replace it with a new one

            ContainerRequest toBeReplacedReq =

                getContainerReqToReplace(allocated);

            if (toBeReplacedReq != null) {

              LOG.info("Placing a new container request for task attempt "

                  + toBeReplacedReq.attemptID);

              ContainerRequest newReq =

                  getFilteredContainerRequest(toBeReplacedReq);

              decContainerReq(toBeReplacedReq);

              if (toBeReplacedReq.attemptID.getTaskId().getTaskType() ==

                  TaskType.MAP) {

                maps.put(newReq.attemptID, newReq);

              }

              else {

                reduces.put(newReq.attemptID, newReq);

              }

              addContainerReq(newReq);

            }

    private ContainerRequest getContainerReqToReplace(Container allocated) {

      Priority priority = allocated.getPriority();

      ContainerRequest toBeReplaced = null;

+     if (PRIORITY_FAST_FAIL_MAP.equals(priority)) {

+        LOG.info("Replacing FAST_FAIL_MAP container " + allocated.getId());

+        Iterator<TaskAttemptId> iter = earlierFailedMaps.iterator();

+        while (toBeReplaced == null && iter.hasNext()) {

+          toBeReplaced = maps.get(iter.next());

+        }

+        LOG.info("Found replacement: " + toBeReplaced);

+        return toBeReplaced;

+      }

+      else if (PRIORITY_MAP.equals(priority)) {

+        LOG.info("Replacing MAP container " + allocated.getId());

-      if (PRIORITY_FAST_FAIL_MAP.equals(priority)

-          || PRIORITY_MAP.equals(priority)) {

        // allocated container was for a map

        String host = allocated.getNodeId().getHost();

        LinkedList<TaskAttemptId> list = mapsHostMapping.get(host);

        if (list != null && list.size() > 0) {

          TaskAttemptId tId = list.removeLast();

          if (maps.containsKey(tId)) {

            toBeReplaced = maps.remove(tId);

          }

        }

        else {

          TaskAttemptId tId = maps.keySet().iterator().next();

          toBeReplaced = maps.remove(tId);

        }

      }

      else if (PRIORITY_REDUCE.equals(priority)) {

        TaskAttemptId tId = reduces.keySet().iterator().next();

        toBeReplaced = reduces.remove(tId);

      }

      return toBeReplaced;

    }

  --- So here is the bug: When the AM notice that the container allocated by RM is on a Host that is blacklisted, it will try to find a replacement container request (attemptID). But in its logic "getContainerReqToReplace", it did not handle the blacklisted nodes correctly --- in fact, it handled it the same way as if it is just a normal MAP request. Therefore, it will get a containerRequest on the same host. In this case, it will get attempt_1375386260815_0001_m_000000_3 --- the request ID which is currently assigned to the new job (running job). As a consequence, that job will suddenly lose its requestID and hence hang forever.

12. Another host (nodemanager) heartbeat:

----- RM log: ------

897 2013-08-01 15:44:23,707 DEBUG [AsyncDispatcher event handler] fifo.FifoScheduler (FifoScheduler.java:nodeUpdate(593)) - Node heartbeat h3:1234 available resource = memory: 10240

933 2013-08-01 15:44:23,713 DEBUG [AsyncDispatcher event handler] scheduler.SchedulerApp (SchedulerApp.java:allocate(254)) - allocate: applicationAttemptId=appattempt_1375386260815_0001_000001 container=container_1375386260815_0001_01_000004 host=h3 type=RACK_LOCAL

… ...

4. Root cause

In the container assignment process it didn’t handle the containers on blacklisted host correctly and incorrectly kicked out another running job by reusing its attemptID.

4.1 Category:

Incorrect exception handling.

5. Fix

5.1 How?

See above. Handle blacklisted node’s container correctly.