Open Source Software Study Raw Data.xlsx
 Share
The version of the browser you are using is no longer supported. Please upgrade to a supported browser.Dismiss

 
View only
 
 
Still loading...
ABCDEFGHIJKLMNOPQ
1
ProjectKeySummaryDescriptionBUGIssue TypeStatusPriorityResolutionAssigneeReporterCreatorCreatedLast ViewedUpdatedResolvedFixing Time
2
Hadoop CommonHADOOP-9964O.A.H.U.ReflectionUtils.printThreadInfo() is not thread-safe which cause TestHttpServer pending 10 minutes or longer.The printThreadInfo() in ReflectionUtils is not thread-safe which cause two or more threads calling this method from StackServlet to deadlock.DeadlockBugClosedMajorFixedJunping DuJunping DuJunping Du9/16/13 7:056/29/15 13:052/24/14 20:589/30/13 18:0314.5
3
Hadoop CommonHADOOP-9932Improper synchronization in RetryCacheIn LightWeightCache#evictExpiredEntries(), the precondition check can fail. [~patwhitey2007] ran a HA failover test and it occurred while the SBN was catching up with edits during a transition to active. This caused NN to terminate.

Here is my theory: If an RPC handler calls waitForCompletion() and it happens to remove the head of the queue in get(), it will race with evictExpiredEntries() frrom put().
Data raceBugClosedBlockerFixedKihwal LeeKihwal LeeKihwal Lee9/4/13 17:409/4/14 0:499/4/13 21:010.1
4
Hadoop CommonHADOOP-9916Race condition in ipc.Client causes TestIPC timeoutTestIPC timeouts occasionally, for example:
[https://issues.apache.org/jira/browse/HDFS-5130?focusedCommentId=13749870&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13749870]
[https://issues.apache.org/jira/browse/HADOOP-9915?focusedCommentId=13753302&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13753302]

Look into the code, there is race condition in oah.ipc.Client, the race condition happen between RPC call thread and connection read response thread:

{code}
if (status == RpcStatusProto.SUCCESS) {
Writable value = ReflectionUtils.newInstance(valueClass, conf);
value.readFields(in); // read value
call.setRpcResponse(value);
calls.remove(callId);
{code}

Read Thread:
Connection.receiveRpcResponse-> call.setRpcResponse(value) -> notify Call Thread

tyutuClient.call -> Connection.addCall(retry with the same callId) -> notify read thread

Read Thread:
calls.remove(callId) # intend to remove old call, but removes newly added call...
Connection.waitForWork end up wait maxIdleTime and close the connection. The call never get response and dead.

The problem doesn't show because previously callId is unique, we never accidentally remove newly added calls, but when retry added this race condition became possible.

To solve this, we can simply change order, remove the call first, then notify call thread.
Note there are many places need this order change(normal case, error case, cleanup case)

And there are some minor issues in TestIPC:
1. there are two method with same name:
void testSerial()
void testSerial(int handlerCount, boolean handlerSleep, ...)
the second is not a test case(so should not use testXXX prefix), but somehow it causes testSerial(first one) run two times, see test report:
{code}
<testcase time="26.896" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
<testcase time="25.426" classname="org.apache.hadoop.ipc.TestIPC" name="testSerial"/>
{code}

2. timeout annotation should be added, so next time related log is available.

Order violationBugClosedMinorFixedBinglin ChangBinglin ChangBinglin Chang8/29/13 8:279/25/13 0:339/4/13 11:366.1
5
Hadoop CommonHADOOP-9894Race condition in Shell leads to logged error stream handling exceptionsShell.runCommand starts an error stream handling thread and normally joins with it before closing the error stream. However if parseExecResult throws an exception (e.g.: like Stat.parseExecResult does for FileNotFoundException) then the error thread is not joined and the error stream can be closed before the error stream handling thread is finished. This causes the error stream handling thread to log an exception backtrace for a "normal" situation.Order violationBugResolvedMajorFixedArpit AgarwalJason LoweJason Lowe8/21/13 19:228/29/13 15:428/29/13 1:177.2
6
Hadoop CommonHADOOP-9787ShutdownHelper util to shutdown threads and threadpoolsSeveral classes spawn threads and threadpools and shut them down on close() and serviceStop() etc. A helper class helps standardize the wait time after a thread/threadpool is interrupted/shutdown for it to actually terminate.

One example of this is MAPREDUCE-5428.
StarvationBugClosedMajorFixedKarthik KambatlaKarthik KambatlaKarthik Kambatla7/29/13 21:1311/3/14 18:348/1/13 1:072.2
7
Hadoop CommonHADOOP-969deadlock in job tracker RetireJobsThe JobTracker deadlocks because RetireJobs grabs locks in the wrong order. The call stacks look like:

"IPC Server handler 5 on 50020":
at org.apache.hadoop.mapred.JobTracker.getNewTaskForTaskTracker(JobTracker.java:1108)
- waiting to lock <0x74487a80> (a java.util.Vector)
- locked <0x744874b0> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:992)
- locked <0x744874b0> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:337)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:538)
"retireJobs":
at org.apache.hadoop.mapred.JobTracker.removeJobTasks(JobTracker.java:782)
- waiting to lock <0x744874b0> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.JobTracker.access$300(JobTracker.java:42)
at org.apache.hadoop.mapred.JobTracker$RetireJobs.run(JobTracker.java:312)
- locked <0x74487bb0> (a java.util.ArrayList)
- locked <0x74487a80> (a java.util.Vector)
- locked <0x74487a58> (a java.util.TreeMap)
at java.lang.Thread.run(Thread.java:595)

Found 1 deadlock.

DeadlockBugClosedCriticalFixedOwen O'MalleyOwen O'MalleyOwen O'Malley2/2/07 17:587/8/09 17:522/2/07 20:080.1
8
Hadoop CommonHADOOP-9678TestRPC#testStopsAllThreads intermittently fails on WindowsException:
{noformat}
junit.framework.AssertionFailedError: null
at org.apache.hadoop.ipc.TestRPC.testStopsAllThreads(TestRPC.java:440)
{noformat}
Data raceBugClosedMajorFixedIvan MiticIvan MiticIvan Mitic6/29/13 23:529/3/14 23:507/2/13 6:512.3
9
Hadoop CommonHADOOP-9549WebHdfsFileSystem hangs on close()When close() is called via fs shoutdown hook, the synchronized method, removeRenewAction() hangs. This is because DelegationTokenRenewer calls DelayQueue.take() inside a synchronized block. Since this is a blocking call, it hangs forever.SuspensionBugClosedBlockerFixedDaryn SharpKihwal LeeKihwal Lee5/1/13 22:289/3/14 23:475/10/13 17:268.8
10
Hadoop CommonHADOOP-9504MetricsDynamicMBeanBase has concurrency issues in createMBeanInfoPlease see HBASE-8416 for detail information.
we need to take care of the synchronization for HashMap put(), otherwise it may lead to spin loop.
Data raceBugClosedCriticalFixedLiang XieLiang XieLiang Xie4/25/13 11:378/4/13 9:237/7/13 0:0872.5
11
Hadoop CommonHADOOP-9478Fix race conditions during the initialization of Configuration related to deprecatedKeyMapWhen we lanuch the client appliation which use kerberos security,the FileSystem can't be create because the exception ' java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.SecurityUtil'.

I check the exception stack trace,it maybe caused by the unsafe get operation of the deprecatedKeyMap which used by the org.apache.hadoop.conf.Configuration.

So I write a simple test case:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.hdfs.HdfsConfiguration;

public class HTest {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.addResource("core-site.xml");
conf.addResource("hdfs-site.xml");
FileSystem fileSystem = FileSystem.get(conf);
System.out.println(fileSystem);
System.exit(0);
}
}

Then I launch this test case many times,the following exception is thrown:

Exception in thread "TGT Renewer for XXX" java.lang.ExceptionInInitializerError
at org.apache.hadoop.security.UserGroupInformation.getTGT(UserGroupInformation.java:719)
at org.apache.hadoop.security.UserGroupInformation.access$1100(UserGroupInformation.java:77)
at org.apache.hadoop.security.UserGroupInformation$1.run(UserGroupInformation.java:746)
at java.lang.Thread.run(Thread.java:662)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 16
at java.util.HashMap.getEntry(HashMap.java:345)
at java.util.HashMap.containsKey(HashMap.java:335)
at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1989)
at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1867)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:1785)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:712)
at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:731)
at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1047)
at org.apache.hadoop.security.SecurityUtil.<clinit>(SecurityUtil.java:76)
... 4 more
Exception in thread "main" java.io.IOException: Couldn't create proxy provider class org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:453)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:133)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:436)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:403)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:125)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2262)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:86)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2296)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
at HTest.main(HTest.java:11)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at org.apache.hadoop.hdfs.NameNodeProxies.createFailoverProxyProvider(NameNodeProxies.java:442)
... 11 more
Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.hadoop.security.SecurityUtil
at org.apache.hadoop.net.NetUtils.createSocketAddrForHost(NetUtils.java:231)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:211)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:159)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:148)
at org.apache.hadoop.hdfs.DFSUtil.getAddressesForNameserviceId(DFSUtil.java:452)
at org.apache.hadoop.hdfs.DFSUtil.getAddresses(DFSUtil.java:434)
at org.apache.hadoop.hdfs.DFSUtil.getHaNnRpcAddresses(DFSUtil.java:496)
at org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.<init>(ConfiguredFailoverProxyProvider.java:88)
... 16 more


If the HashMap used at multi-thread enviroment,not only the put operation be synchronized,the get operation(eg. containKey) should be synchronzied too.

The simple solution is trigger the init of SecurityUtil before creating the FileSystem,but I think it's should be synchronized for get of deprecatedKeyMap.

Thanks.
Data raceBugClosedMajorFixedColin Patrick McCabeDongyong WangDongyong Wang4/16/13 13:082/25/14 1:212/25/14 1:21314.5
12
Hadoop CommonHADOOP-9459ActiveStandbyElector can join election even before Service HEALTHY, and results in null data at ActiveBreadCrumbActiveStandbyElector can store null at ActiveBreadCrumb in the below race condition. At further all failovers will fail resulting NPE.

1. ZKFC restarted.
2. due to machine busy, first zk connection is expired even before the health monitoring returned the status.
3. On re-establishment transitionToActive will be called, at this time appData will be null,
4. So now ActiveBreadCrumb will have null.
5. After this any failovers will fail throwing

{noformat}java.lang.NullPointerException
at org.apache.hadoop.util.StringUtils.byteToHexString(StringUtils.java:171)
at org.apache.hadoop.ha.ActiveStandbyElector.fenceOldActive(ActiveStandbyElector.java:892)
at org.apache.hadoop.ha.ActiveStandbyElector.becomeActive(ActiveStandbyElector.java:797)
at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:475)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:545)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497){noformat}


Should not join the election before service is HEALTHY
Order violationBugClosedCriticalFixedVinayakumar BVinayakumar BVinayakumar B2/1/13 12:349/3/14 23:475/14/13 7:44101.8
13
Hadoop CommonHADOOP-9393TestRPC fails with JDK7Since the test order is different with JDK7, we hit an error in {{TestRPC#testStopsAllThreads}} on the initial assertEquals. This is because {{testRPCInterruptedSimple}} and {{testRPCInterrupted}} don't stop the server after they finish, leaving some threads around.SuspensionBugResolvedMajorFixedAndrew WangAndrew WangAndrew Wang3/11/13 19:283/12/13 14:043/12/13 6:260.5
14
Hadoop CommonHADOOP-9252StringUtils.humanReadableInt(..) has a race conditionhumanReadableInt(..) incorrectly uses oneDecimal without synchronization.

Also, limitDecimalTo2(double) correctly uses decimalFormat with synchronization. However, synchronization can be avoided for a better performance.
Data raceBugClosedMinorFixedTsz Wo Nicholas SzeTsz Wo Nicholas SzeTsz Wo Nicholas Sze1/26/13 1:353/14/13 16:032/4/13 21:439.8
15
Hadoop CommonHADOOP-9220Unnecessary transition to standby in ActiveStandbyElectorWhen performing a manual failover from one HA node to a second, under some circumstances the second node will transition from standby -> active -> standby -> active. This is with automatic failover enabled, so there is a ZK cluster doing leader election.
SuspensionBugClosedCriticalFixedTom WhiteTom WhiteTom White1/17/13 16:059/3/14 23:475/14/13 16:37117.0
16
Hadoop CommonHADOOP-9212Potential deadlock in FileSystem.Cache/IPC/UGIjcarder found a cycle which could lead to a potential deadlock.DeadlockBugClosedMajorFixedTom WhiteTom WhiteTom White1/15/13 10:372/15/13 13:121/16/13 10:511.0
17
Hadoop CommonHADOOP-9183Potential deadlock in ActiveStandbyElectorA jcarder run found a potential deadlock in the locking of ActiveStandbyElector and ActiveStandbyElector.WatcherWithClientRef. No deadlock has been seen in practice, this is just a theoretical possibility at the moment.DeadlockBugClosedMajorFixedTom WhiteTom WhiteTom White1/7/13 16:532/15/13 13:111/10/13 10:092.7
18
Hadoop CommonHADOOP-9152HDFS can report negative DFS Used on clusters with very small amounts of dataI had a near empty HDFS instance where I was creating a file and deleting it very quickly. I noticed that HDFS sometimes reported a negative DFS used.

{noformat}
root@brock0-1 ~]# sudo -u hdfs -i hdfs dfsadmin -report
Configured Capacity: 97233235968 (90.56 GB)
Present Capacity: 84289609707 (78.5 GB)
DFS Remaining: 84426645504 (78.63 GB)
DFS Used: -137035797 (-133824.02 KB)
DFS Used%: -0.16%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: brock0-1.ent.cloudera.com
Decommission Status : Normal
Configured Capacity: 97233235968 (90.56 GB)
DFS Used: -137035797 (-133824.02 KB)
Non DFS Used: 12943626261 (12.05 GB)
DFS Remaining: 84426645504 (78.63 GB)
DFS Used%: -0.14%
DFS Remaining%: 86.83%
Last contact: Thu Nov 22 18:25:37 PST 2012




[root@brock0-1 ~]# sudo -u hdfs -i hdfs dfsadmin -report
Configured Capacity: 97233235968 (90.56 GB)
Present Capacity: 84426973184 (78.63 GB)
DFS Remaining: 84426629120 (78.63 GB)
DFS Used: 344064 (336 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 1 (1 total, 0 dead)

Live datanodes:
Name: 127.0.0.1:50010 (localhost)
Hostname: brock0-1.ent.cloudera.com
Decommission Status : Normal
Configured Capacity: 97233235968 (90.56 GB)
DFS Used: 344064 (336 KB)
Non DFS Used: 12806262784 (11.93 GB)
DFS Remaining: 84426629120 (78.63 GB)
DFS Used%: 0%
DFS Remaining%: 86.83%
Last contact: Thu Nov 22 18:28:47 PST 2012
{noformat}
Order violationBugClosedMinorFixedBrock NolandBrock NolandBrock Noland11/26/12 22:462/15/13 13:1112/18/12 19:5121.9
19
Hadoop CommonHADOOP-9137Support connection limiting in IPC serverData raceBugClosedMajorFixedKihwal LeeSanjay RadiaSanjay Radia11/30/12 18:104/24/15 23:481/30/15 23:24791.2
20
Hadoop CommonHADOOP-9126FormatZK and ZKFC startup can fail due to zkclient connection establishment delayFormat and ZKFC startup flows continue further after creation of zkclient connection without waiting to check whether the connection is completely established. This leads to failure at the subsequent point if connection was not complete by then.

Exception trace for format
{noformat}
12/05/30 19:48:24 INFO zookeeper.ClientCnxn: Socket connection established to HOST-xx-xx-xx-55/xx.xx.xx.55:2182, initiating session
12/05/30 19:48:24 INFO zookeeper.ClientCnxn: Session establishment complete on server HOST-xx-xx-xx-55/xx.xx.xx.55:2182, sessionid = 0x1379da4660c0014, negotiated timeout = 5000
12/05/30 19:48:24 WARN ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x1379da4660c0014
12/05/30 19:48:24 INFO zookeeper.ZooKeeper: Session: 0x1379da4660c0014 closed
12/05/30 19:48:24 INFO zookeeper.ClientCnxn: EventThread shut down
Exception in thread "main" java.io.IOException: Couldn't determine existence of znode '/hadoop-ha/hacluster'
at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:263)
at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:257)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:195)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:58)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:163)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:159)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:438)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:159)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:171)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hadoop-ha/hacluster
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049)
at org.apache.hadoop.ha.ActiveStandbyElector.parentZNodeExists(ActiveStandbyElector.java:261)
... 8 more

{noformat}
Data raceBugResolvedMajorFixedRakesh Rsuja ssuja s5/30/12 15:155/2/13 3:2912/10/12 22:13194.3
21
Hadoop CommonHADOOP-9115Deadlock in configuration when writing configuration to hdfsThis was noticed when using hive with hadoop-1.1.1 and running

{code}
select count(*) from tbl;
{code}

This would cause a deadlock configuration.
DeadlockBugClosedBlockerFixedJing ZhaoArpit GuptaArpit Gupta12/4/12 18:313/6/13 9:5512/5/12 0:130.2
22
Hadoop CommonHADOOP-911Multithreading issue with libhdfs libraryMultithreaded applications using libhdfs sometimes run into IllegalMonitorStateException or plainly lock up (strace shows a thread being in a loop of calling sched_yield). It probably has to do with the fact that libhdfs does not ensure proper allocation of hashtable entries that map a threadId to JNI interface pointer.Not ClearBugClosedBlockerFixedChristian KunzChristian KunzChristian Kunz1/19/07 22:476/29/15 14:567/8/09 18:057/11/07 21:39173.0
23
Hadoop CommonHADOOP-9036TestSinkQueue.testConcurrentConsumers fails intermittently (Backports HADOOP-7292)org.apache.hadoop.metrics2.impl.TestSinkQueue.testConcurrentConsumers


Error Message

should've thrown
Stacktrace

junit.framework.AssertionFailedError: should've thrown
at org.apache.hadoop.metrics2.impl.TestSinkQueue.shouldThrowCME(TestSinkQueue.java:229)
at org.apache.hadoop.metrics2.impl.TestSinkQueue.testConcurrentConsumers(TestSinkQueue.java:195)
Standard Output

2012-10-03 16:51:31,694 INFO impl.TestSinkQueue (TestSinkQueue.java:consume(243)) - sleeping
Data raceBugClosedMajorFixedSuresh SrinivasIvan MiticIvan Mitic11/14/12 1:345/15/13 6:1611/18/12 21:364.8
24
Hadoop CommonHADOOP-8982TestSocketIOWithTimeout fails on WindowsThis is a possible race condition or difference in socket handling on Windows.Data raceBugClosedMajorFixedChris NaurothChris NaurothChris Nauroth10/25/12 0:069/3/14 23:476/6/13 5:18224.2
25
Hadoop CommonHADOOP-886thousands of TimerThreads created by metrics APIWhen running the smallJobsBenchmark with 180 maps and hadoop metrics logging to a file
(ie hadoop-metrics.properties file contains
dfs.class=org.apache.hadoop.metrics.file.FileContext
mapred.class=org.apache.hadoop.metrics.file.FileContext)
then I get this error:

org.apache.hadoop.ipc.RemoteException: java.io.IOException: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:574)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:517)
at org.apache.hadoop.ipc.Client.call(Client.java:452)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:164)
at org.apache.hadoop.dfs.$Proxy0.isDir(Unknown Source)
at org.apache.hadoop.dfs.DFSClient.isDirectory(DFSClient.java:325)
at org.apache.hadoop.dfs.DistributedFileSystem.isDirectory(DistributedFileSystem.java:167)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:82)
at org.apache.hadoop.dfs.DistributedFileSystem.copyToLocalFile(DistributedFileSystem.java:222)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:842)
at org.apache.hadoop.mapred.JobInProgress.<init>(JobInProgress.java:86)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1338)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:337)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:538)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)

Using jconsole, I see that 2000+ of these threads were created:

Name: Timer-101
State: TIMED_WAITING on java.util.TaskQueue@1501026
Total blocked: 0 Total waited: 5
Stack trace:
java.lang.Object.wait(Native Method)
java.util.TimerThread.mainLoop(Timer.java:509)
java.util.TimerThread.run(Timer.java:462)

The only use of the java.util.Timer API is in org.apache.hadoop.metrics.spi.AbstractMetricsContext.
StarvationBugClosedMajorFixedNigel DaleyNigel DaleyNigel Daley1/12/07 3:472/3/07 3:231/17/07 0:124.9
26
Hadoop CommonHADOOP-8732Address intermittent test failures on WindowsThere are a few tests that fail intermittently on Windows with a timeout error. This means that the test was actually killed from the outside, and it would continue to run otherwise.

The following are examples of such tests (there might be others):
- TestJobInProgress (this issue reproes pretty consistently in Eclipse on this one)
- TestControlledMapReduceJob
- TestServiceLevelAuthorization
Data raceBugResolvedMajorFixedIvan MiticIvan MiticIvan Mitic8/27/12 18:388/30/12 3:328/30/12 3:322.4
27
Hadoop CommonHADOOP-8725MR is broken when security is offHADOOP-8225 broke MR when security is off. MR was changed to stop re-reading the credentials that UGI had already read, and to stop putting those tokens back into the UGI where they already were. UGI only reads a credentials file when security is enabled, but MR uses tokens (ie. job token) even when security is disabled...Data raceBugClosedBlockerFixedDaryn SharpDaryn SharpDaryn Sharp8/24/12 2:4610/11/12 18:458/24/12 15:200.5
28
Hadoop CommonHADOOP-8684Deadlock between WritableComparator and WritableComparableClasses implementing WriableComparable in Hadoop call the method WritableComparator.define() in their static initializers. This means, the classes call the method define() while thier class loading, under locking their class objects. And, the method WritableComparator.define() locks the WritableComaprator class object.

On the other hand, WritableComparator.get() also locks the WritableComparator class object, and the method may create instances of the targeted comparable class, involving loading the targeted comparable class if any. This means, the method might try to lock the targeted comparable class object under locking the WritableComparator class object.

There are reversed orders of locking objects, and you might fall in deadlock.
Order violationBugClosedMinorFixedJing ZhaoHiroshi IkedaHiroshi Ikeda8/10/12 2:179/4/14 0:218/31/12 18:0221.7
29
Hadoop CommonHADOOP-8564Port and extend Hadoop native libraries for Windows to address datanode concurrent reading and writing issueHDFS files are made up of blocks. First, let’s look at writing. When the data is written to datanode, an active or temporary file is created to receive packets. After the last packet for the block is received, we will finalize the block. One step during finalization is to rename the block file to a new directory. The relevant code can be found via the call sequence: FSDataSet.finalizeBlockInternal -> FSDir.addBlock.
{code}
if ( ! metaData.renameTo( newmeta ) ||
! src.renameTo( dest ) ) {
throw new IOException( "could not move files for " + b +
" from tmp to " +
dest.getAbsolutePath() );
}
{code}

Let’s then switch to reading. On HDFS, it is expected the client can also read these unfinished blocks. So when the read calls from client reach datanode, the datanode will open an input stream on the unfinished block file.

The problem comes in when the file is opened for reading while the datanode receives last packet from client and try to rename the finished block file. This operation will succeed on Linux, but not on Windows . The behavior can be modified on Windows to open the file with FILE_SHARE_DELETE flag on, i.e. sharing the delete (including renaming) permission with other processes while opening the file. There is also a Java bug ([id 6357433|http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6357433]) reported a while back on this. However, since this behavior exists for Java on Windows since JDK 1.0, the Java developers do not want to break the backward compatibility on this behavior. Instead, a new file system API is proposed in JDK 7.


As outlined in the [Java forum|http://www.java.net/node/645421] by the Java developer (kbr), there are three ways to fix the problem:
# Use different mechanism in the application in dealing with files.
# Create a new implementation of InputStream abstract class using Windows native code.
# Patch JDK with a private patch that alters FileInputStream behavior.

For the third option, it cannot fix the problem for users using Oracle JDK.

We discussed some options for the first approach. For example one option is to use two phase renaming, i.e. first hardlink; then remove the old hardlink when read is finished. This option was thought to be rather pervasive. Another option discussed is to change the HDFS behavior on Windows by not allowing client reading unfinished blocks. However this behavior change is thought to be problematic and may affect other application build on top of HDFS.

For all the reasons discussed above, we will use the second approach to address the problem.

If there are better options to fix the problem, we would also like to hear about them.
Data raceBugResolvedMajorFixedChuan LiuChuan LiuChuan Liu7/5/12 19:485/2/13 3:2910/23/12 0:13109.2
30
Hadoop CommonHADOOP-846Progress report is not sent during the intermediate sorts in the map phaseHave seen tasks getting lost at the TaskTracker's end due to MapTask's progress not getting reported for a long time (the configured timeout). The progress report is currently not sent in the intermediate sort phases in the MapTask. But, if for some reason, the sort takes a long time, the TaskTracker might kill the task.Data raceBugClosedMajorFixedDevaraj DasDevaraj DasDevaraj Das12/22/06 16:507/8/09 17:521/4/07 18:5113.1
31
Hadoop CommonHADOOP-8406CompressionCodecFactory.CODEC_PROVIDERS iteration is thread-unsafeCompressionCodecFactory defines CODEC_PROVIDERS as:
{code}
private static final ServiceLoader<CompressionCodec> CODEC_PROVIDERS =
ServiceLoader.load(CompressionCodec.class);
{code}
but this is a lazy collection which is thread-unsafe to iterate. We either need to synchronize when we iterate over it, or we need to materialize it during class-loading time by copying to a non-lazy collection
Data raceBugClosedMajorFixedTodd LipconTodd LipconTodd Lipcon5/17/12 1:5610/11/12 18:455/17/12 5:280.1
32
Hadoop CommonHADOOP-8325Add a ShutdownHookManager to be used by different components instead of the JVM shutdownhookFileSystem adds a JVM shutdown hook when a filesystem instance is cached.

MRAppMaster also uses a JVM shutdown hook, among other things, the MRAppMaster JVM shutdown hook is used to ensure state are written to HDFS.

This creates a race condition because each JVM shutdown hook is a separate thread and if there are multiple JVM shutdown hooks there is not assurance of order of execution, they could even run in parallel.


Data raceBugClosedCriticalFixedAlejandro AbdelnurAlejandro AbdelnurAlejandro Abdelnur4/27/12 19:145/2/13 3:294/30/12 20:223.0
33
Hadoop CommonHADOOP-8245Fix flakiness in TestZKFailoverControllerWhen I loop TestZKFailoverController, I occasionally see two types of failures:
1) the ZK JMXEnv issue (ZOOKEEPER-1438)
2) TestZKFailoverController.testZooKeeperFailure fails with a timeout

This JIRA is for fixes for these issues.
Data raceBugResolvedMinorFixedTodd LipconTodd LipconTodd Lipcon4/4/12 4:044/4/12 20:214/4/12 20:210.7
34
Hadoop CommonHADOOP-8220ZKFailoverController doesn't handle failure to become active correctlyThe ZKFC doesn't properly handle the case where the monitored service fails to become active. Currently, it catches the exception and logs a warning, but then continues on, after calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance while handling that same request. There is a test case, but the test case doesn't ensure that the node that had the failure is later able to recover properly.Order violationBugResolvedCriticalFixedTodd LipconTodd LipconTodd Lipcon3/27/12 5:296/29/15 13:153/30/12 21:243/30/12 21:243.7
35
Hadoop CommonHADOOP-815Investigate and fix the extremely large memory-footprint of JobTrackerThe JobTracker's memory footprint seems excessively large, especially when many jobs are submitted.

Here is the 'top' output of a JobTracker which has scheduled ~1k jobs thus far:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31877 arunc 19 0 2362m 261m 13m S 14.0 12.9 24:48.08 java

Clearly VIRTual memory of 2364Mb v/s 261Mb of RESident memory is symptomatic of this issue...
Data raceBugClosedMajorFixedArun C MurthyArun C MurthyArun C Murthy12/12/06 6:195/2/13 3:291/8/07 19:2527.5
36
Hadoop CommonHADOOP-814Increase dfs scalability by optimizing locking on namenode.The current dfs namenode encounters locking bottlenecks when the number of datanodes is large. The namenode uses a single global lock to protect access to data structures. One key area is heartbeat processing. The lower the cost of processing a heartbeat, more the number of nodes HDFS can support. A simple change to this current locking model can increase the scalability. Here are the details:

Case 1: Currently we have three locks, the global lock (on FSNamesystem), the heartbeat lock and the datanodeMap lock. The following function is called when a heartbeat is received by the Namenode

public synchronized FSNamesystem. gotHeartbeat() { ........ (A)
synchronized (heartbeat) { ........ (B)
synchronized (datanodeMap) { ......... (C)
...
}
}

In the above piece of code, statement (A) acquires the global-FSNamesystem-lock. This synchronization can be safely removed (remove updateStats too). This means that a heartbeat from the datanode can be processed without holding the FSnamesystem-global-lock.

Case 2: A following thread called the heartbeatCheck thread periodically traverses all known Datanodes to determine if any of them has timed out. It is of the following form:

void FSNamesystem.heartbeatCheck() {
synchronized (this) { ........... (D)
synchronized (heartbeats) { .............(E)
}

This thread acquires the global-FSNamesystem lock in Statement (D). This statement (D) can be removed. Instead the loop can check to see if any nodes are dead. If a dead node is found, only then it acquires the FSNamesystem-global-lock.

It is possible that fixing the above two cases will cause HDFS to scale to higher number of nodes.

Atomicity violation BugClosedMajorFixeddhruba borthakurdhruba borthakurdhruba borthakur12/12/06 5:077/8/09 17:4212/18/06 22:406.7
37
Hadoop CommonHADOOP-8110TestViewFsTrash occasionally fails{noformat}
junit.framework.AssertionFailedError: -expunge failed expected:<0> but was:<1>
at junit.framework.Assert.fail(Assert.java:47)
at junit.framework.Assert.failNotEquals(Assert.java:283)
at junit.framework.Assert.assertEquals(Assert.java:64)
at junit.framework.Assert.assertEquals(Assert.java:195)
at org.apache.hadoop.fs.TestTrash.trashShell(TestTrash.java:322)
at org.apache.hadoop.fs.viewfs.TestViewFsTrash.testTrash(TestViewFsTrash.java:73)
...
{noformat}
There are quite a few TestViewFsTrash failures recently. E.g. [build #624 for trunk|https://builds.apache.org/job/PreCommit-HADOOP-Build/624//testReport/org.apache.hadoop.fs.viewfs/TestViewFsTrash/testTrash/] and [build #2 for 0.23-PB|https://builds.apache.org/view/G-L/view/Hadoop/job/Hadoop-Common-0.23-PB-Build/2/testReport/junit/org.apache.hadoop.fs.viewfs/TestViewFsTrash/testTrash/].
Data raceBugClosedMajorFixedJason LoweTsz Wo Nicholas SzeTsz Wo Nicholas Sze2/24/12 18:4910/11/12 18:457/3/12 21:31130.1
38
Hadoop CommonHADOOP-8050Deadlock in metricsThe metrics serving thread and the periodic snapshot thread can deadlock.
It happened a few times on one of namenodes we have. When it happens RPC works but the web ui and hftp stop working. I haven't look at the trunk too closely, but it might happen there too.
DeadlockBugClosedMajorFixedKihwal LeeKihwal LeeKihwal Lee2/9/12 20:2911/14/13 19:452/22/12 16:1312.8
39
Hadoop CommonHADOOP-7964Deadlock in class init.After HADOOP-7808, client-side commands hang occasionally. There are cyclic dependencies in NetUtils and SecurityUtil class initialization. Upon initial look at the stack trace, two threads deadlock when they hit the either of class init the same time.DeadlockBugClosedBlockerFixedDaryn SharpKihwal LeeKihwal Lee1/7/12 0:051/31/12 1:561/24/12 1:3517.1
40
Hadoop CommonHADOOP-791deadlock issue in taskstracker.the stack trace--
"main":
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.createStatus(TaskTracker.java:880)
- waiting to lock <0xea101658> (a org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
at org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:489)
- locked <0x75505f00> (a org.apache.hadoop.mapred.TaskTracker)
at org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:442)
at org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:720)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:1374)
"taskCleanup":
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.cleanup(TaskTracker.java:1072)
- waiting to lock <0x75505f00> (a org.apache.hadoop.mapred.TaskTracker)
at org.apache.hadoop.mapred.TaskTracker$TaskInProgress.jobHasFinished(TaskTracker.java:1013)
- locked <0xea101658> (a org.apache.hadoop.mapred.TaskTracker$TaskInProgress)
at org.apache.hadoop.mapred.TaskTracker$1.run(TaskTracker.java:144)
at java.lang.Thread.run(Thread.java:595)
Found 1 deadlock.

The jobhasfinished method and transmitHeart beat lock the tasktracker and tip in a different order. Also , before emitting HeartBeat we should be updating the status and removing entries from runningtasks. Currently this is done after the heartbeat.
DeadlockBugClosedMajorFixedMahadev konarMahadev konarMahadev konar12/7/06 4:377/8/09 17:5212/15/06 22:128.7
41
Hadoop CommonHADOOP-7888TestFailoverProxy fails intermittently on trunkTestFailoverProxy can fail intermittently with the failures occurring in testConcurrentMethodFailures(). The test has a race condition where the two threads may be sequentially invoking the unreliable interface rather than concurrently. Currently the proxy provider's getProxy() method contains the thread synchronization to enforce a concurrent invocation, but examining the source to RetryInvocationHandler.invoke() shows that the call to getProxy() during failover is too late to enforce a truly concurrent invocation.

For this particular test, one thread could race ahead and block on the CountDownLatch in getProxy() before the other thread even enters RetryInvocationHandler.invoke(). If that happens the second thread will cache the newly updated value for proxyProviderFailoverCount, since the failover has mostly been processed by the original thread. Therefore the second thread ends up assuming no other thread is present, performs a failover, and the test fails because two failovers occurred instead of one.
Order violationBugClosedMajorFixedJason LoweJason LoweJason Lowe12/7/11 1:545/23/12 21:1512/8/11 1:211.0
42
Hadoop CommonHADOOP-7854UGI getCurrentUser is not synchronizedSporadic {{ConcurrentModificationExceptions}} are originating from {{UGI.getCurrentUser}} when it needs to create a new instance. The problem was specifically observed in a JT under heavy load when a post-job cleanup is accessing the UGI while a new job is being processed.Data raceBugClosedCriticalFixedDaryn SharpDaryn SharpDaryn Sharp11/23/11 18:1312/28/11 10:0312/1/11 1:227.3
43
Hadoop CommonHADOOP-7822Hadoop startup script has a race condition : this causes failures in datanodes status and stop commandsThe symptoms are the following:

a) start-all.sh is able to start both hadoop dfs and map-reduce processes, assuming same grid nodes are used for dfs and map-reduce
b) stop-all.sh stops map-reduce but fails to stop dfs processes (datanode tasks on grid nodes)
Instead, the warning message 'no datanode to stop' is seen for all data nodes.
c) The 'pid' files for datanode processes do not exist therefore the only way to stop datanode processes is to manually execute kill commands.


The root cause of the issue appears to be in hadoop startup scripts. start-all.sh is really two parts:

1. start-dfs.sh : Start namenode and datanodes

2. start-mapred.sh: Jobtracker and task trackers.

In this case, running start-dfs.sh did as expected and created the pid files for different datanodes. However, start-mapred.sh script did end up forcing another rsync from master to slaves, effectively wiping out the pid files stored under "pid" directory.
Data raceBugResolvedMajorFixedUnassignedRahul JainRahul Jain11/15/11 0:043/20/15 0:483/20/15 0:481221.0
44
Hadoop CommonHADOOP-7633log4j.properties should be added to the hadoop conf on deploycurrently the log4j properties are not present in the hadoop conf dir. We should add them so that log rotation happens appropriately and also define other logs that hadoop can generate for example the audit and the auth logs as well as the mapred summary logs etc.Data raceBugClosedMajorFixedEric YangArpit GuptaArpit Gupta9/13/11 21:195/11/12 15:529/20/11 23:447.1
45
Hadoop CommonHADOOP-7529Possible deadlock in metrics2Lock cycle detected by jcarder between MetricsSystemImpl and DefaultMetricsSystemDeadlockBugClosedCriticalFixedLuke LuTodd LipconTodd Lipcon8/8/11 23:3511/15/11 0:508/12/11 17:593.8
46
Hadoop CommonHADOOP-752Possible locking issues in HDFS NamenodeI have been investigating the cause of random Namenode memory corruptions/memory overflows, etc. Please comment.

1. The functions datanodeReport() and DFSNodesStatus() do not acquire the global lock.
This can race with another thread invoking registerDatanode(). registerDatanode()
can remove a datanode (thru wipeDatanode()) while the datanodeReport thread is
traversing the list of datanodes. This can cause exceptions to occur.

2. The blocksMap is protected by the global lock. The setReplication() call does not acquire
the global lock when it calls proccessOverReplicatedBlock(). This can cause corruption in blockMap.


Atomicity violation BugClosedMajorFixeddhruba borthakurdhruba borthakurdhruba borthakur11/28/06 5:517/8/09 17:4212/7/06 18:469.5
47
Hadoop CommonHADOOP-750race condition on stalled map output fetchesI've seen reduces getting killed because of a race condition in the ReduceTaskRunner. In the logs it looks like:

2006-11-27 08:40:44,795 WARN org.apache.hadoop.mapred.TaskRunner: Map output copy stalled on http://kry2296.inktomisearch.com:7030/mapOutput?map=task_0001_m_015626_0
...
2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0 Need 52 map output(s)
2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0 Got 39 known map output location(s); scheduling...
2006-11-27 09:16:41,361 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0 Scheduled 0 of 39 known outputs (0 slow hosts and 39 dup hosts)
...
2006-11-27 09:16:47,071 INFO org.apache.hadoop.mapred.TaskTracker: task_0001_r_000658_0 0.3328575% reduce > copy (28679 of 28720 at 0.76 MB/s) >
...
2006-11-27 09:16:47,338 INFO org.apache.hadoop.mapred.TaskRunner: task_0001_r_000658_0 done copying task_0001_m_015462_0 output from node1
...
2006-11-27 09:36:51,398 INFO org.apache.hadoop.mapred.TaskTracker: task_0001_r_000658_0: Task failed to report status for 1204 seconds. Killing.

Basically, the handling of the stall has a race condition that leaves the fetcher in a bad state. At the end of the fetch, all of the tasks finish and their results never get handled. When the thread times out, all of the map output copiers are waiting for things to fetch and the prepare thread is waiting for results.
SuspensionBugClosedMajorFixedOwen O'MalleyOwen O'MalleyOwen O'Malley11/28/06 0:307/8/09 17:5212/1/06 22:273.9
48
Hadoop CommonHADOOP-737TaskTracker's job cleanup loop should check for finished job before deleting local directoriesTaskTracker uses jobClient.pollForTaskWithClosedJob() to find tasks which should be closed. This mechanism doesnt pass the information on whether the job is really finished or the task is being killed for some other reason( speculative instance succeeded). Since Tasktracker doesnt know this state it assumes job is finished and deletes local job dir, causing any subsequent tasks on the same task tracker for same job to fail with job.xml not found exception as reported in HADOOP-546 and possibly in HADOOP-543. This causes my patch for HADOOP-76 to fail for a large number of reduce tasks in some cases.

Same causes extra exceptions in logs while a job is being killed, the first task that gets closed will delete local directories and any other tasks (if any) which are about to get launched will throw this exception. In this case it is less significant is as the job is killed anyways and only logs get extra exceptions.

Possible solutions :
1. Add an extra method in InetTrackerProtocol for checking for job status before deleting local directory.
2. Set TaskTracker.RunningJob.localized to false once the local directory is deleted so that new tasks don't look for it there.

There is clearly a race condition in this and logs may still get the exception while shutdown but in normal cases it would work.

Comments ?
Data raceBugClosedCriticalFixedArun C MurthySanjay DahiyaSanjay Dahiya11/20/06 12:465/2/13 3:2912/12/06 6:0521.7
49
Hadoop CommonHADOOP-723Race condition exists in the method MapOutputLocation.getFileThere seems to be a race condition in the way the Reduces copy the map output files from the Maps. If a copier is blocked in the connect method (in the beginning of the method MapOutputLocation.getFile) to a Jetty on a Map, and the MapCopyLeaseChecker detects that the copier was idle for too long, it will go ahead and issue a interrupt (read 'kill') to this thread and create a new Copier thread. However, the copier, currently blocked trying to connect to Jetty on a Map, doesn't actually get killed until the connect timeout expires and as soon as the connect comes out (with an IOException), it will delete the map output file which actually could have been (successfully) created by the new Copier thread. This leads to the Sort phase for that reducer failing with a FileNotFoundException.
One simple way to fix this is to not delete the file if the file was not created within this getFile method.
Data raceBugClosedMajorFixedOwen O'MalleyDevaraj DasDevaraj Das11/15/06 16:457/8/09 17:5211/20/06 23:355.3
50
Hadoop CommonHADOOP-7183WritableComparator.get should not cache comparator objectsHADOOP-6881 modified WritableComparator.get such that the constructed WritableComparator gets saved back into the static map. This is fine for stateless comparators, but some comparators have per-instance state, and thus this becomes thread-unsafe and causes errors in the shuffle where multiple threads are doing comparisons. An example of a Comparator with per-instance state is WritableComparator itself.Data raceBugClosedBlockerFixedTom WhiteTodd LipconTodd Lipcon3/11/11 18:2212/12/11 6:195/6/11 6:0955.5
51
Hadoop CommonHADOOP-7055Update of commons logging libraries causes EventCounter to count logging events incorrectly2|hycayv:Data raceBugClosedMajorFixedJingguo YaoJingguo YaoJingguo Yao11/30/10 8:5311/15/11 0:505/12/11 19:41163.5
52
Hadoop CommonHADOOP-6939Inconsistent lock ordering in AbstractDelegationTokenSecretManagerAbstractDelegationTokenSecretManager.startThreads() is synchronized, which calls updateCurrentKey(), which calls logUpdateMasterKey. logUpdateMasterKey's implementation for HDFS's manager calls namesystem.logUpdateMasterKey() which is synchronized. Thus the lock order is ADTSM -> FSN. In FSN.saveNamespace, though, it calls DTSM.saveSecretManagerState(), so the lock order is FSN -> ADTSM.

I don't think this deadlock occurs in practice since saveNamespace won't occur until after the ADTSM has started its threads, but should be fixed anyway.
Order violationBugClosedMinorFixedTodd LipconTodd LipconTodd Lipcon9/3/10 23:4811/15/11 0:5012/3/10 23:0291.0
53
Hadoop CommonHADOOP-6762exception while doing RPC I/O closes channelIf a single process creates two unique fileSystems to the same NN using FileSystem.newInstance(), and one of them issues a close(), the leasechecker thread is interrupted. This interrupt races with the rpc namenode.renew() and can cause a ClosedByInterruptException. This closes the underlying channel and the other filesystem, sharing the connection will get errors.DeadlockBugClosedCriticalFixedsam rashsam rashsam rash5/12/10 17:249/4/14 0:1112/10/12 21:26943.2
54
Hadoop CommonHADOOP-6691TestFileSystemCaching sometimes hangTestFileSystemCaching#testCacheEnabledWithInitializeForeverFS() sometimes hangs if InitializeForeverFileSystem initializes first.Order violationBugClosedMajorFixedHairong KuangHairong KuangHairong Kuang4/8/10 19:468/24/10 21:424/8/10 22:340.1
55
Hadoop CommonHADOOP-6640FileSystem.get() does RPC retries within a static synchronized blockIf using FileSystem.get() in a multithreaded environment, and one get() locks because the NN URI is too slow or not responding and retries are in progress, all other get() (for the diffferent users, NN) are blocked.

the synchronized block in in the static instance of Cache inner class.
SuspensionBugClosedCriticalFixedHairong KuangAlejandro AbdelnurAlejandro Abdelnur3/17/10 23:068/24/10 21:424/1/10 22:3615.0
56
Hadoop CommonHADOOP-6609Deadlock in DFSClient#getBlockLocations even with the security disabledHere is the stack trace:
"IPC Client (47) connection to XX" daemon
prio=10 tid=0x00002aaae0369c00 nid=0x655b waiting for monitor entry [0x000000004181f000..0x000000004181fb80]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:210)
- waiting to lock <0x00002aaab3eaee50> (a org.apache.hadoop.io.DataOutputBuffer)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:179)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:638)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:573)

"IPC Client (47) connection to /0.0.0.0:50030 from job_201002262308_0007"
daemon prio=10 tid=0x00002aaae0272800 nid=0x6556 waiting for monitor entry [0x000000004131a000..0x000000004131ad00]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:210)
- waiting to lock <0x00002aaab3eaee50> (a org.apache.hadoop.io.DataOutputBuffer)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:179)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:66)
at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:638)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:573)

"main" prio=10 tid=0x0000000046c17800 nid=0x6544 in Object.wait() [0x0000000040207000..0x0000000040209ec0]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaacee6bc38> (a org.apache.hadoop.ipc.Client$Call)
at java.lang.Object.wait(Object.java:485)
at org.apache.hadoop.ipc.Client.call(Client.java:854) - locked <0x00002aaacee6bc38> (a org.apache.hadoop.ipc.Client$Call)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:223)
at $Proxy2.getBlockLocations(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy2.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:333)
at org.apache.hadoop.hdfs.DFSClient.access$2(DFSClient.java:330)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.getBlockAt(DFSClient.java:1606)
- locked <0x00002aaacecb8258> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1704)
- locked <0x00002aaacecb8258> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1856)
- locked <0x00002aaacecb8258> (a org.apache.hadoop.hdfs.DFSClient$DFSInputStream)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:211)
- locked <0x00002aaab3eaee50> (a org.apache.hadoop.io.DataOutputBuffer)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:203)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:90)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:1)
at org.apache.hadoop.mapred.MapTask.getSplitDetails(MapTask.java:341)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:357)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:317)
at org.apache.hadoop.mapred.Child$4.run(Child.java:211)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:700)
at org.apache.hadoop.mapred.Child.main(Child.java:205)
DeadlockBugClosedMajorFixedOwen O'MalleyHairong KuangHairong Kuang3/3/10 0:168/24/10 21:423/4/10 7:211.3
57
Hadoop CommonHADOOP-6572RPC responses may be out-of-order with respect to SASLSASL enforces its own message ordering. When RPC server sends its responses back, response A may be wrapped by SASL before response B but is put on response queue after response B. This results in RPC client receiving wrapped response B ahead of A. When the received messages are unwrapped by SASL, SASL complaints the messages are out of order.Order violationBugClosedMajorFixedKan ZhangKan ZhangKan Zhang2/17/10 17:178/24/10 21:422/19/10 9:051.7
58
Hadoop CommonHADOOP-6508Incorrect values for metrics with CompositeContextIn our clusters, when we use CompositeContext with two contexts, second context gets wrong values.
This problem is consistent on 500 (and above) node cluster.
Data raceBugClosedMajorFixedLuke LuAmareshwari SriramadasuAmareshwari Sriramadasu1/25/10 3:5911/15/11 0:505/23/11 20:01483.7
59
Hadoop CommonHADOOP-6498IPC client bug may cause rpc call hangI can reproduce some rpc call hang bug when connection thread of ipc client receives response for outstanding call.

The stacks when hang occurs (TaskTracker):

Waiting on org.apache.hadoop.ipc.Client$Call@1c3cbb4b
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
org.apache.hadoop.ipc.Client.call(Client.java:691)
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
org.apache.hadoop.mapred.$Proxy4.heartbeat(Unknown Source)
org.apache.hadoop.mapred.TaskTracker.transmitHeartBeat(TaskTracker.java:1250)
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:1082)
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1785)
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2796)
SuspensionBugResolvedBlockerFixedRuyue MaRuyue MaRuyue Ma1/18/10 16:227/22/14 20:191/26/10 23:058.3
60
Hadoop CommonHADOOP-6386NameNode's HttpServer can't instantiate InetSocketAddress: IllegalArgumentException is thrownIn an execution of a tests the following exception has been thrown:
{noformat}
Error Message

port out of range:-1

Stacktrace
java.lang.IllegalArgumentException: port out of range:-1
at java.net.InetSocketAddress.<init>(InetSocketAddress.java:118)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:371)
at org.apache.hadoop.hdfs.server.namenode.NameNode.activate(NameNode.java:313)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:304)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:410)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:404)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1211)
at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:287)
at org.apache.hadoop.hdfs.MiniDFSCluster.<init>(MiniDFSCluster.java:131)
at org.apache.hadoop.hdfs.server.namenode.TestEditLog.testEditLog(TestEditLog.java:92)
{noformat}
Data raceBugClosedBlockerFixedKonstantin BoudnikKonstantin BoudnikKonstantin Boudnik11/13/09 16:158/24/10 21:402/1/10 3:1879.5
61
Hadoop CommonHADOOP-638TaskTracker missing synchronization around tasks variable accessTaskTracker has one access to the tasks variable (line 449) that is not synchronized. It should be.Data raceBugClosedMajorFixedNigel DaleyNigel DaleyNigel Daley10/25/06 22:347/8/09 17:5110/26/06 21:311.0
62
Hadoop CommonHADOOP-627MiniMRCluster missing synchronizationorg.apache.hadoop.mapred.MiniMRCluster$TaskTrackerRunner contains (at least) 2 instance variables that are read by another thread: isInitialized and isDead. These should be declared volatile or proper synchronization should be used for their access.
Data raceBugClosedMajorFixedNigel DaleyNigel DaleyNigel Daley10/24/06 2:027/8/09 17:5110/24/06 22:530.9
63
Hadoop CommonHADOOP-6269Missing synchronization for defaultResources in Configuration.addResourceConfiguration.defaultResources is a simple ArrayList. In two places in Configuration it is accessed without appropriate synchronization, which we've seen to occasionally result in ConcurrentModificationExceptions.Data raceBugResolvedMajorFixedSreekanth RamakrishnanTodd LipconTodd Lipcon9/18/09 0:077/2/10 5:4611/21/09 4:2364.2
64
Hadoop CommonHADOOP-6221RPC Client operations cannot be interruptedRPC.waitForProxy swallows any attempts to interrupt it while waiting for a proxy; this makes it hard to shutdown a service that you are starting; you have to wait for the timeouts.

There are only 4-5 places in the code that use either of the two overloaded methods, removing the catch and changing the signature should not be too painful, unless anyone is using the method outside the hadoop codebase.
SuspensionBugClosedMinorFixedSteve LoughranSteve LoughranSteve Loughran8/27/09 13:0510/2/09 13:051/26/15 22:061978.4
65
Hadoop CommonHADOOP-6220HttpServer wraps InterruptedExceptions by IOExceptions if interrupted in startupFollowing on some discusson on mapred-dev, we should keep an eye on the fact that Jetty uses sleeps when starting up; jetty can be a big part of the delays of bringing up a node. When interrupted, the exception is wrapped by an IOException, the root cause is still there, just hidden.

If we want callers to distinguish InterruptedExceptions from IOEs, then this exception should be extracted. Some helper method to start an http daemon could do this -catch the IOE, and if there is a nested interrupted exception, rethrow it, otherwise rethrowing the original IOE
Data raceBugResolvedMinorFixedSteve LoughranSteve LoughranSteve Loughran8/27/09 13:013/28/12 10:189/28/11 21:38762.4
66
Hadoop CommonHADOOP-600Race condition in JobTracker updating the task tracker's status while declaring it lostThere was a case where the JobTracker lost track of a set of tasks that were on a task tracker. It appears to be a race condition because the ExpireTrackers thread doesn't lock the JobTracker while updating the state. The fix would be to build a list of dead task trackers and then lock the job tracker while updating their status.Data raceBugClosedMajorFixedArun C MurthyOwen O'MalleyOwen O'Malley10/12/06 17:505/2/13 3:291/8/07 19:4388.1
67
Hadoop CommonHADOOP-5921JobTracker does not come up because of NotReplicatedYetExceptionSometimes (On a big cluster) Jobtracker does not come up, because Info file could not be replicated on dfs.LivelockBugResolvedMajorFixedAmar KamatAmareshwari SriramadasuAmareshwari Sriramadasu5/27/09 10:507/8/09 17:536/15/09 7:1718.9
68
Hadoop CommonHADOOP-5859FindBugs : fix "wait() or sleep() with locks held" warnings in hdfsThis JIRA fixes the following warnings:

SWL org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal() calls Thread.sleep() with a lock held
TLW wait() with two locks held in org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal()
TLW wait() with two locks held in org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal()
TLW wait() with two locks held in org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(byte[], int, int, byte[])
Atomicity violation BugClosedMajorFixedKan ZhangKan ZhangKan Zhang5/16/09 1:388/24/10 21:376/4/09 19:1519.7
69
Hadoop CommonHADOOP-5746Errors encountered in MROutputThread after the last map/reduce call can go undetectedThe framework map/reduce bridge methods make a check at the beginning of the respective methods whether _MROutputThread_ encountered an exception while writing keys/values that the streaming process emitted. However, if the exception happens in _MROutputThread_ after the last call to the map/reduce method, the exception goes undetected. An example of such an exception is an exception from the _DFSClient_ that fails to write to a file on the HDFS.Order violationBugResolvedMajorFixedAmar KamatDevaraj DasDevaraj Das4/27/09 4:397/8/09 18:056/4/09 13:4538.4
70
Hadoop CommonHADOOP-5644Namnode is stuck in safe modeRestarting datanodes while a client is writing to it can cause namenode to get stuck in safe mode.Data raceBugClosedMajorFixedSuresh SrinivasSuresh SrinivasSuresh Srinivas4/9/09 0:507/8/09 17:434/15/09 7:166.3
71
Hadoop CommonHADOOP-5585FileSystem statistic counters are too high when JVM reuse is enabled.When JVM reuse is enabled, the FileSystem.Statistics are not cleared between tasks. That means that the second task gets credit for its own reads and writes as well as the first. The third gets credit for all 3 tasks reads and writes.Data raceBugClosedBlockerFixedOwen O'MalleyOwen O'MalleyOwen O'Malley3/26/09 23:357/8/09 17:534/8/09 6:2312.3
72
Hadoop CommonHADOOP-5548Observed negative running maps on the job trackerWe saw in both the web/ui and cli tools:

{noformat}
Cluster Summary (Heap Size is 11.7 GB/13.37 GB)

Maps Reduces Total Nodes Map Task Reduce Task Avg. Blacklisted
Submissions Capacity Capacity Tasks/Node Nodes
-971 0 133 434 1736 1736 8.00 0
{noformat}

Data raceBugClosedBlockerFixedAmareshwari SriramadasuOwen O'MalleyOwen O'Malley3/20/09 20:254/23/09 20:184/7/09 13:1717.7
73
Hadoop CommonHADOOP-5534Deadlock triggered by FairScheduler scheduler's servlet due to changes from HADOOP-5214.DeadlockBugClosedBlockerFixedrahul k singhVinod Kumar VavilapalliVinod Kumar Vavilapalli3/19/09 8:267/8/09 17:413/19/09 13:220.2
74
Hadoop CommonHADOOP-5490TestParallelInitialization failed on NoSuchElementExceptionjava.util.NoSuchElementException
at java.util.AbstractList$Itr.next(AbstractList.java:350)
at java.util.Collections.sort(Collections.java:162)
at org.apache.hadoop.mapred.EagerTaskInitializationListener.resortInitQueue(EagerTaskInitializationListener.java:162)
at org.apache.hadoop.mapred.EagerTaskInitializationListener.jobAdded(EagerTaskInitializationListener.java:137)
at org.apache.hadoop.mapred.TestParallelInitialization$FakeTaskTrackerManager.submitJob(TestParallelInitialization.java:142)
at org.apache.hadoop.mapred.TestParallelInitialization.testParallelInitJobs(TestParallelInitialization.java:185)
Atomicity violation BugClosedBlockerFixedJothi PadmanabhanHairong KuangHairong Kuang3/13/09 22:157/8/09 17:533/16/09 8:582.4
75
Hadoop CommonHADOOP-5473Race condition in command-line kill for a taskThe race condition occurs in following sequence of events:
1. User issues a command-line kill for a RUNNING map-task. JT stores the task in tasksToKill mapping.
2. TT reports the task status as SUCCEEDED.
3. JT creates a TaskCompletionEvent as SUCCEEDED. Also sends a killTaskAction.
4. Reducers fail fetching the map output.
5. finally, the task would fail with Fetch failures. After HADOOP-4759, the task is left as FAILED_UNCLEAN task, since the task is present in tasksToKill mapping.
Order violationBugClosedMajorFixedAmareshwari SriramadasuAmareshwari SriramadasuAmareshwari Sriramadasu3/12/09 8:057/8/09 17:533/30/09 13:0418.2
76
Hadoop CommonHADOOP-5465Blocks remain under-replicatedOccasionally we see some blocks remain to be under-replicated in our production clusters. This is what we obeserved:
1. Sometimes when increasing the replication factor of a file, some blocks belonged to this file do not get to increase to the new replication factor.
2. When taking meta save in two different days, some blocks remain in under-replication queue.
Data raceBugClosedBlockerFixedHairong KuangHairong KuangHairong Kuang3/11/09 23:037/8/09 17:433/13/09 19:591.9
77
Hadoop CommonHADOOP-5439FileSystem.create() with overwrite param specified sometimes takes a long time to return.If a file already exists, it takes a long time for the overwrite create to return.
{code}
fs.create(path_1, true);
{code}
sometimes takes a long time.

Instead, the code:
{code}
if (fs.exists(path_1))
fs.delete(path_1);
fs.create(path_1);
{code}
works pretty well.
SuspensionBugResolvedMajorFixedUnassignedHe YongqiangHe Yongqiang3/9/09 6:024/9/09 6:027/21/14 22:391960.7
78
Hadoop CommonHADOOP-5412TestInjectionForSimulatedStorage occasionally fails on timeoutOccasionally TestInjectionForSimulatedStorage falls into an infinite loop, waiting for a block to reach its replication factor. The log repeatedly prints the following message:
dfs.TestInjectionForSimulatedStorage (TestInjectionForSimulatedStorage.java:waitForBlockReplication(89)) - Not enough replicas for 2th block blk_6302924909504458109_1001 yet. Expecting 4, got 2.

Order violationBugResolvedMajorFixedHairong KuangHairong KuangHairong Kuang3/5/09 21:353/13/09 15:053/11/09 21:296.0
79
Hadoop CommonHADOOP-5337JobTracker greedily schedules tasks without running tasks to joinThis issue was observed when JobTracker was restarted 3 times and observed that 4 instances of each reduce task were running. This issue is observed when cluster is not fully occupied.

In testcase: Map/reduces capacity is 200/200 slots respectively and Job profile is 11000 maps, 10 reduces and speculative execution is off. JobTracker was restarted 3 times in small intervals of about 5 mins and after recovery, 40 reduce tasks were running. Task details web page (taskdetails.jsp) was showing 4 running attempts of each reduce task.
Data raceBugClosedMajorFixedAmar KamatKaram SinghKaram Singh2/26/09 12:067/8/09 17:534/3/09 13:0236.0
80
Hadoop CommonHADOOP-5311Write pipeline recovery failsA write pipeline recovery fails on the error below:
INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 53006, call recoverBlock(blk_1415000632081498137_954380, false, [Lorg.apache.hadoop.hdfs.protocol.DatanodeInfo;
@4ec82dc6) from XX: error: org.apache.hadoop.ipc.RemoteException: java.io.IOException: blk_1415000632081498137_954380 is already commited, storedBlock == null.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4487)
at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:473)
at sun.reflect.GeneratedMethodAccessor27.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
Not ClearBugResolvedBlockerFixeddhruba borthakurHairong KuangHairong Kuang2/23/09 19:427/8/09 17:433/13/09 6:2417.4
81
Hadoop CommonHADOOP-5285JobTracker hangs for long periods of timeOn one of the larger clusters of 2000 nodes, JT hanged quite often, sometimes for times in the order of 10-15 minutes and once for one and a half hours(!). The stack trace shows that JobInProgress.obtainTaskCleanupTask() is waiting for lock on JobInProgress object which JobInProgress.initTasks() is holding for a long time waiting for DFS operations.SuspensionBugClosedBlockerFixedDevaraj DasVinod Kumar VavilapalliVinod Kumar Vavilapalli2/19/09 12:157/8/09 17:532/23/09 7:373.8
82
Hadoop CommonHADOOP-5206All "unprotected*" methods of FSDirectory should synchronize on the root.Synchronization on {{rootDir}} is missing for two (relatively new) methods:
- {{unprotectedSetQuota()}}
- {{unprotectedSetTimes()}}
Data raceBugClosedMajorFixedJakob HomanKonstantin ShvachkoKonstantin Shvachko2/9/09 20:018/24/10 21:352/17/09 18:127.9
83
Hadoop CommonHADOOP-51544-way deadlock in FairShare schedulerThis happened while trying to change the priority of a job from the scheduler servlet.SuspensionBugClosedBlockerFixedMatei ZahariaVinod Kumar VavilapalliVinod Kumar Vavilapalli2/2/09 7:417/8/09 17:412/25/09 15:0523.3
84
Hadoop CommonHADOOP-5146LocalDirAllocator misses files on the local filesystemFor some reason the LocalDirAllocator.getLocaPathToRead doesn't find files which are present, extra logging shows:

{noformat}
2009-01-30 06:43:32,312 INFO org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: in ifExists, /grid/2/arunc/mapred-local/taskTracker/archive/xxx.yyy.com/tera/in/_partition.lst exists
2009-01-30 06:43:32,389 WARN org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext: in getLocalPathToRead, taskTracker/archive/xxx.yyy.com/tera/in/_partition.lst doesn't exist
2009-01-30 06:43:32,390 WARN org.apache.hadoop.mapred.TaskRunner: attempt_200901300512_0007_m_000055_0 Child Error
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/archive/xx.yyy.com/tera/in/_partition.lst in any of the configured local directories
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:388)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:172)
{noformat}
Data raceBugClosedBlockerFixedDevaraj DasArun C MurthyArun C Murthy1/30/09 2:377/8/09 17:532/28/09 7:2829.2
85
Hadoop CommonHADOOP-4977Deadlock between reclaimCapacity and assignTasksI was running the latest trunk with the capacity scheduler and saw the JobTracker lock up with the following deadlock reported in jstack:

Found one Java-level deadlock:
=============================
"18107298@qtp0-4":
waiting to lock monitor 0x08085b40 (object 0x56605100, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 4 on 54311"
"IPC Server handler 4 on 54311":
waiting to lock monitor 0x0808594c (object 0x5660e518, a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr),
which is held by "reclaimCapacity"
"reclaimCapacity":
waiting to lock monitor 0x08085b40 (object 0x56605100, a org.apache.hadoop.mapred.JobTracker),
which is held by "IPC Server handler 4 on 54311"

Java stack information for the threads listed above:
===================================================
"18107298@qtp0-4":
at org.apache.hadoop.mapred.JobTracker.getClusterStatus(JobTracker.java:2695)
- waiting to lock <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
at org.apache.hadoop.mapred.jobtracker_jsp._jspService(jobtracker_jsp.java:93)
at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:97)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363)
at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:324)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403)
at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522)
"IPC Server handler 4 on 54311":
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.updateQSIObjects(CapacityTaskScheduler.java:564)
- waiting to lock <0x5660e518> (a org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.assignTasks(CapacityTaskScheduler.java:855)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$1000(CapacityTaskScheduler.java:294)
at org.apache.hadoop.mapred.CapacityTaskScheduler.assignTasks(CapacityTaskScheduler.java:1336)
- locked <0x5660dd20> (a org.apache.hadoop.mapred.CapacityTaskScheduler)
at org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2288)
- locked <0x56605100> (a org.apache.hadoop.mapred.JobTracker)
at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)

Unfortunately I didn't manage to select all of the output by mistake, so some is missing, but it appears that reclaimCapacity locks the MapSchedulingMgr and then tries to lock the JobTracker, whereas the updateQSIObjects called in assignTasks holds a lock on the JobTracker (the JobTracker grabs this lock when it calls assignTasks) and then tries to lock the MapSchedulingMgr. The other thread listed there is a Jetty thread for the web interface and isn't part of the circular locking. The solution to this would be to lock the JobTracker in reclaimCapacity before locking anything else.
DeadlockBugClosedBlockerFixedVivek RatanMatei ZahariaMatei Zaharia1/2/09 23:187/8/09 17:401/15/09 2:5812.2
86
Hadoop CommonHADOOP-4971Block report times from datanodes could converge to same time.Datanode block reports take quite a bit of memory to process at the namenode. After the inital report, DNs pick a random time to spread this load across at the NN. This normally works fine.

Block reports are sent inside "offerService()" thread in DN. If for some reason this thread was stuck for long time (comparable to block report interval), and same thing happens on many DNs, all of them get back to the loop at the same time and start sending block report then and every hour at the same time.

RPC server and clients in 0.18 can handle this situation fine. But since this is a memory intensive RPC it lead to large GC delays at the NN. We don't know yet why offerService therads seemed to be stuck, but DN should re-randomize it block report time in such cases.
SuspensionBugClosedBlockerFixedRaghu AngadiRaghu AngadiRaghu Angadi12/31/08 20:407/8/09 17:431/8/09 1:037.2
87
Hadoop CommonHADOOP-4924Race condition in re-init of TaskTrackerThe taskReportServer is stopped in the TaskTracker.close() method in a thread. The race condition is:
1) TaskTracker.close() is invoked - this starts a thread to stop the taskReportServer
2) TaskTracker.initialize is invoked - this tries to create a new taskReportServer
Assume that the thread started to stop the taskReportServer gets to start its work after (2) above. The thread will end up stopping the newly created taskReportServer.
Order violationBugClosedBlockerFixedDevaraj DasDevaraj DasDevaraj Das12/21/08 15:371/30/09 20:1412/22/08 13:130.9
88
Hadoop CommonHADOOP-4904Deadlock while leaving safe mode.{{SafeModeInfo.leave()}} acquires locks in an incorrect order, which causes the deadlock.
It first acquires the {{SafeModeInfo}} lock, then calls {{FSNamesystem.processMisReplicatedBlocks()}}, which requires the global {{FSNamesystem}} lock.
It should be the other way around: first {{FSNamesystem}} lock, then {{SafeModeInfo}}.
DeadlockBugClosedBlockerFixedKonstantin ShvachkoKonstantin ShvachkoKonstantin Shvachko12/17/08 17:207/8/09 17:4312/19/08 2:381.4
89
Hadoop CommonHADOOP-4840TestNodeCount sometimes fails with NullPointerExceptionTestcase: testNodeCount took 9.628 sec
Caused an ERROR

java.lang.NullPointerException
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.countNodes(FSNamesystem.java:3523)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.countNodes(FSNamesystem.java:3543)
at org.apache.hadoop.hdfs.server.namenode.TestNodeCount.testNodeCount(TestNodeCount.java:64)
Data raceBugClosedMajorFixedHairong KuangHairong KuangHairong Kuang12/11/08 18:517/8/09 17:4312/19/08 1:137.3
90
Hadoop CommonHADOOP-4744Wrong resolution of hostname and portI noticed the following for one of the hosts in a cluster:

1. machines.jsp page resolves the http address as just "http://hostname" (which doesn't work). It doesnt put the port number for the host. Even if I add the port number manually in the URI, the task tracker page does not come up.
2. All the tasks(both maps and reduces) which ran on the machine ran successfully. But tasklogs cannot be viewed, because port-number is not resolved. ( same problem as in (1)).
3. The reducers waiting for maps ran on that machine fail with connection failed errors saying the hostname is 'null'.

Data raceBugClosedBlockerFixedJothi PadmanabhanAmareshwari SriramadasuAmareshwari Sriramadasu12/1/08 10:2811/14/09 2:123/2/09 10:3791.0
91
Hadoop CommonHADOOP-4679Datanode prints tons of log messages: Waiting for threadgroup to exit, active theads is XXWhen a data receiver thread sees a disk error, it immediately calls shutdown to shutdown DataNode. But the shutdown method does not return before all data receiver threads exit, which will never happen. Therefore the DataNode gets into a dead/live lock state, emitting tons of log messages: Waiting for threadgroup to exit, active threads is XX.LivelockBugClosedMajorFixedHairong KuangHairong KuangHairong Kuang11/18/08 19:277/8/09 17:4312/3/08 20:0715.0
92
Hadoop CommonHADOOP-4614"Too many open files" error while processing a large gzip fileI am running a simple word count program on a gzip compressed data of size 4 GB (Uncompressed size is about 7 GB). I have setup of 17 nodes in my Hadoop cluster. After some time, I get the following exception:

java.io.FileNotFoundException: /usr/local/hadoop/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_200811041109_0003/attempt_200811041109_0003_m_000000_0/output/spill4055.out.index
(Too many open files)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:137)
at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.(RawLocalFileSystem.java:62)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.(RawLocalFileSystem.java:98)
at org.apache.hadoop.fs.RawLocalFileSystem.open(RawLocalFileSystem.java:168)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:359)
at org.apache.hadoop.mapred.IndexRecord.readIndexFile(IndexRecord.java:47)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.getIndexInformation(MapTask.java:1339)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1237)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:857)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:333)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

From a user's perspective I know that Hadoop will use only one mapper for a gzipped file. The above exception suggests that probably Hadoop puts the intermediate data into many files. But the question is that "exactly how many open files are required in the worst case for any data size and cluster size?" Currently it looks as if Hadoop needs more number of open files as the size of input or the cluster size (in terms of nodes, mapper, reducers) increases. This is not plausible as far as scalability is concerned. A user needs to write some number in the /etc/security/limits.conf file that how many open files are allowed by hadoop node. The question is what that "magical number" should be?

So probably the best solution to this problem is to change Hadoop such a way that it can work with some moderate number of allowed open files (e.g. 4 K) or any other number should be suggested as an upper limit such that a user is sure that for any data size and cluster size, hadoop will not run into this "too many open files" issue.
StarvationBugClosedBlockerFixedYuri PradkinAbdul QadeerAbdul Qadeer11/7/08 8:566/29/15 14:207/8/09 17:5311/26/08 1:4218.7
93
Hadoop CommonHADOOP-4595JVM Reuse triggers RuntimeException("Invalid state")A Reducer triggers the following exception:

08/11/05 08:58:50 INFO mapred.JobClient: Task Id : attempt_200811040110_0230_r_000008_1, Status : FAILED
java.lang.RuntimeException: Inconsistent state!!! JVM Manager reached an unstable state while reaping a JVM for task: attempt_200811040110_0230_r_000008_1 Number of active JVMs:2
JVMId jvm_200811040110_0230_r_-735233075 #Tasks ran: 0 Currently busy? true Currently running: attempt_200811040110_0230_r_000012_0
JVMId jvm_200811040110_0230_r_-1716942642 #Tasks ran: 0 Currently busy? true Currently running: attempt_200811040110_0230_r_000040_0
at java.lang.Throwable.<init>(Throwable.java:67)
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.reapJvm(JvmManager.java:245)
at org.apache.hadoop.mapred.JvmManager$JvmManagerForType.access$000(JvmManager.java:113)
at org.apache.hadoop.mapred.JvmManager.launchJvm(JvmManager.java:78)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:410)

Other clues:

In the three reduce task attempts where this was observed, this was attempt _1. Attempt _0 had started and eventually switches to "SUCCEEDED." So I think this is happening only on speculatively-executed reduce task attempts. The reduce output (part-XXXXX) gets lost when this attempt fails, even though the other (earlier) attempt succeeded.
Data raceBugClosedMajorFixedDevaraj DasAaron KimballAaron Kimball11/5/08 19:237/8/09 17:5311/11/08 4:505.4
94
Hadoop CommonHADOOP-4556Block went missingSuspicion that all replicas of a block were marked for deletion. (Don't panic, investigation underway.)Order violationBugClosedMajorFixedHairong KuangRobert ChanslerRobert Chansler10/31/08 0:437/8/09 17:4311/11/08 0:1611.0
95
Hadoop CommonHADOOP-4552Deadlock in RPC ServerRPC server could get into a deadlock especially when clients or server are network starved. This is a deadlock between RPC responder thread trying to check if there are any connection to be purged and RPC handler trying to queue a response to be written by the responder.

This was first observed [this thread|http://www.nabble.com/TaskTrackers-disengaging-from-JobTracker-to20234317.html].
DeadlockBugClosedMajorFixedRaghu AngadiRaghu AngadiRaghu Angadi10/30/08 19:1411/20/08 23:3811/11/08 19:3412.0
96
Hadoop CommonHADOOP-4533HDFS client of hadoop 0.18.1 and HDFS server 0.18.2 (0.18 branch) not compatibleNot sure whether this is considered as a bug or is an expected case.
But here are the details.

I have a cluster using a build from hadoop 0.18 branch.
When I tried to use hadoop 0.18.1 dfs client to load files to it, I got the following exceptions:

hadoop --config ~/test dfs -copyFromLocal gridmix-env /tmp/.
08/10/28 16:23:00 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream
08/10/28 16:23:00 INFO dfs.DFSClient: Abandoning block blk_-439926292663595928_1002
08/10/28 16:23:06 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream
08/10/28 16:23:06 INFO dfs.DFSClient: Abandoning block blk_5160335053668168134_1002
08/10/28 16:23:12 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream
08/10/28 16:23:12 INFO dfs.DFSClient: Abandoning block blk_4168253465442802441_1002
08/10/28 16:23:18 INFO dfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Could not read from stream
08/10/28 16:23:18 INFO dfs.DFSClient: Abandoning block blk_-2631672044886706846_1002
08/10/28 16:23:24 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block.
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

08/10/28 16:23:24 WARN dfs.DFSClient: Error Recovery for block blk_-2631672044886706846_1002 bad datanode[0]
copyFromLocal: Could not get block locations. Aborting...
Exception closing file /tmp/gridmix-env
java.io.IOException: Could not get block locations. Aborting...
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735)
at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889)

This problem has a severe impact on Pig 2.0, since it is pre-packaged with hadoop 0.18.1 and will use
Hadoop 0.18.1 dfs client in its interaction with hadoop cluster.
That means that Pig 2.0 will not work with the to be released hadoop 0.18.2


Not ClearBugClosedBlockerFixedHairong KuangRunping QiRunping Qi10/28/08 16:367/8/09 17:4310/29/08 21:521.2
97
Hadoop CommonHADOOP-4517unstable dfs when running jobs on 0.18.12 attempts of a job using 6000 maps, 1900 reduces

1.st attempt: failed during reduce phase after 22 hours with 31 dead datanodes most of which became unresponsive due to an exception; dfs lost blocks
2nd attempt: failed during map phase after 5 hours with 5 dead datanodes due to exception; dfs lost blocks responsible for job failure.

I will post typical datanode exception and attach thread dump.
DeadlockBugClosedBlockerFixedTsz Wo Nicholas SzeChristian KunzChristian Kunz10/24/08 20:167/8/09 17:4310/28/08 0:073.2
98
Hadoop CommonHADOOP-4513Capacity scheduler should initialize tasks asynchronouslyCurrently, the capacity scheduler initializes tasks on demand, as opposed to the eager initialization technique used by the default scheduler. This is done in order to save JT memory footprint. However, the initialization is done in the {{assignTasks}} API which is not a good idea as task initialization could be a time consuming operation. This JIRA is to move out the initialization outside the {{assignTasks}} API and do it asynchronously.SuspensionBugClosedMajorFixedSreekanth RamakrishnanHemanth YamijalaHemanth Yamijala10/24/08 6:537/8/09 17:4011/28/08 10:3235.2
99
Hadoop CommonHADOOP-4445Wrong number of running map/reduce tasks are displayed in queue information.Wrong number of running map/reduce tasks are displayed in queue information.Data raceBugClosedMajorFixedSreekanth RamakrishnanKaram SinghKaram Singh10/17/08 14:067/8/09 17:4012/5/08 16:5349.1
100
Hadoop CommonHADOOP-4399fuse-dfs per FD context is not thread safe and can cause segfaults and corruptionsfor reads, optimal solution would be to have a per thread (per FD) context - including the buffer. Otherwise, protect the FD context with a mutex as in hadoop-4397 and hadoop-4398.

for writes, should just protect the context with a mutex as in hadoop-4398.

Data raceBugClosedBlockerFixedPete WyckoffPete WyckoffPete Wyckoff10/13/08 1:567/8/09 18:0510/20/08 6:127.2
Loading...
 
 
 
Concurrency Bugs