[hadoop]MAPREDUCE-3531 Report

1. Symptom

Early termination

1.1 Severity


1.2 Was there exception thrown? (Exception column in the spreadsheet)


2011-12-01 11:56:25,202 ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event

type NODE_UPDATE to the scheduler

java.lang.IllegalArgumentException: Invalid key to HMAC computation

Caused by: java.security.InvalidKeyException: Secret key expected

1.2.1 Were there multiple exceptions?


1.3 Scope of the failure

Single RM - Affect all clients

2. How to reproduce this failure

2.0 Version


2.1 Configuration

1 RM

# of Nodes?

10 (in unit test)

(In the production environment it uses a 350 cluster)

2.2 Reproduction procedure

1. Simulate starting 10 attempts

2. Each attempt start 100 threads

3. Each thread requests to call createPassword 1000 times.

2.2.1 Timing order

All thread try to visit the same non thread-safe variable almost simultaneously.

2.2.2 Events order externally controllable?

No(Multithread, data race)

2.3 Can the logs tell how to reproduce the failure?


2.4 How many machines needed?


3. Diagnosis procedure

3.1 Detailed Symptom (where you start)

When multi-threads call the createPassword function almost simultaneously, the variable will be out or order since it is not thread-safe.

3.2 Backward inference

When all threads try to visit the variable, there may cause serious data race or deadlock problems and make RM stops working.

4. Root cause

Using non-thread-safe variable in the concurrency environment.

4.1 Category:

Incorrect error handling (concurrency bug in handling the error).

5. Fix

5.1 How?

Use ConcurrentHashMap, the thread-safe version of HashMap