HDFS-4205 Report

1. Symptom

FSCK fails after a symlink is created.

1.1 Severity

Major.

1.2 Was there exception thrown?

Yes.

1.2.1 Were there multiple exceptions?

Yes: AccessControlException, IOException, HadoopIllegalArgumentException, PriviledgedActionException, UnresolvedPathException.

1.3. Scope of the failure

This bug had a very simple cause, but its consequences were more severe. It would crash the FSCK tool and it would prevent the deletion of the symlink via normal methods. To remove the symlink and fix the problem, the whole parent folder had to be deleted.

2. How to reproduce this failure

2.0 Version

2.0.2-alpha

2.1 Configuration

One namenode and one datanode with one file.

2.2 Reproduction procedure

Since the explicit command is still not available, the procedure was done via test case. A test case was written for the JIRA report page. This procedure cannot be done yet with commands, but a test case can trigger the bug by performing the following actions.

  1. Create a cluster with 4 datanodes;
  2. Create a simple file in it;
  3. Create a symlink;
  4. Try to run the tool FCSK, which should fail.

If they have programmed a command to trigger this functionality, this failure would be able to be reproduced without a test case, by simply creating a symlink and then running an fsck.

2.2.1 Timing order

Not sensitive.

2.2.2 Events order externally controllable?

Yes.

2.3 Can the logs tell how to reproduce the failure?

Yes. Many exceptions are thrown and the symptoms are very clear.

2.4 How many machines needed?

1 machine with at least 1 namenode and 1 datanode.

3. Diagnosis procedure

This error propagation was small and simple to understand, but it could propagate for quite long (one could create the symlink and take very long to use it or to run the fsck tool).

3.1 Detailed Symptom (where you start)

FSCK crashes after a symlink is created.

3.2 Backward inference

The buggy method created an invalid symbolic link on the file system, taking it to an invalid state. This action did not trigger any notifications. The first symptom only appeared when a command checked the status of the symlink and received an UnresolvedPathException. The exception was catched and handled properly and the execution continued. The second symptom appeared when the FSCK tool was ran.

In a real world scenario, the user would probably not check the state of the symlink after creating it. The user would create the symlink and use it right away, triggering the bug. The exception, though, is pretty clear. The FSCK fails because there is an unresolved path, and this path is defined by the exception. So, the diagnosis is straightforward.

4. Root cause

Symlink support is completely broken.

4.1 Category:

Semantic Error

5. Fix

5.1 How?

Rebuilding the symlink support. This primitive is simple and shouldn’t be hard to do, but it is not available yet.