HDFS-3180 Report

1. Symptom

When connection is unusable, webhdfs will hang indefinitely.

1.1 Severity

Major

1.2 Was there exception thrown?

No, webhdfs will hang

1.2.1 Were there multiple exceptions?

No

1.3 Scope of the failure

webhdfs

2. How to reproduce this failure

Break the connection using any method (e.g. disconnect webhdfs’s network), and the problem will be apparent.

2.0 Version

2.0.4-alpha, 2.1.0-beta

2.1 Configuration

webhdfs configuration needed

<property>

               <name>dfs.webhdfs.enabled</name>

               <value>true</value>

 </property>

2.2 Reproduction procedure

1. Start hdfs with webhdfs

2. Terminate webhdfs’s network connection

3. Observe hang

2.2.1 Timing order

Trivial timing. I.e. must start hdfs and webhdfs before reproducing failure.

2.2.2 Events order externally controllable?

Yes

2.3 Can the logs tell how to reproduce the failure?

There is no logs because timeout is not implemented in webhdfs.

2.4 How many machines needed?

1

3. Diagnosis procedure

Symptom leads to the root cause. Someone with domain knowledge would know this feature is not available in webhdfs. So in this case, the person who fixed HDFS-3166 (for hftp timeout) adapted the fix to hdfs-3180.

3.1 Detailed Symptom (where you start)

Trivial: start with a hanged hdfs request.

3.2 Backward inference (how do you infer from the symptom to the root cause)

1. Once we realized webhdfs is hanged, we check the network connection and hdfs status

2. Then we have found the root cause.

3. Then we implemented a timeout feature.

4. Root cause

No connection timeout handling in webhdfs

4.1 Category:

Semantic

4.2 Are there multiple fault?

No

5. Fix

Implemented timeout feature in webhdfs. The detailed source code addition can be found in the patch.

5.1 How?

Implemented new feature to take care of a connection failure event.