AM jobs going out of memory on a large cluster.
Blocker.
Yes. Eventually “UndeclaredThrowableException”.
No.
This job.
0.23.0
Standard
Just run many long MR jobs
None.
Yes.
Yes.
1.
AM + NM
2011-11-02 11:40:36,438 ERROR [ContainerLauncher #258] org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl: Container launch failed for container_1320233407485_0002_01_001434 : java.lang.reflect.UndeclaredThrowableException
--- This is the out of memory/fd
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:88)
at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:290)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is: "gsbl91281.blue.ygrid.yahoo.com/98.137.101.189"; destination host is: ""gsbl91525.blue.ygrid.yahoo.com":45450;
at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:139)
at $Proxy20.startContainer(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagerPBClientImpl.startContainer(ContainerManagerPBClientImpl.java:81)
... 4 more
Caused by: java.io.IOException: Failed on local exception: java.io.IOException: Couldn't set up IO streams; Host Details : local host is: "gsbl91281.blue.ygrid.yahoo.com/98.137.101.189"; destination host is: ""gsbl91525.blue.ygrid.yahoo.com":45450;
at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:655)
at org.apache.hadoop.ipc.Client.call(Client.java:1089)
at org.apache.hadoop.yarn.ipc.ProtoOverHadoopRpcEngine$Invoker.invoke(ProtoOverHadoopRpcEngine.java:136)
... 6 more
Caused by: java.io.IOException: Couldn't set up IO streams
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:621)
at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:205)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1195)
at org.apache.hadoop.ipc.Client.call(Client.java:1065)
... 7 more
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:597)
at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:614)
... 10 more
From the exception, you can see it failed when it is trying to establish IPC connections. From AM’s log, we can see:
2013-07-26 11:01:02,609 INFO [ContainerLauncher #0] org.apache.hadoop.yarn.ipc.YarnRPC: Creating YarnRPC for org.apache.hadoop.yarn.ipc.HadoopYarnProtoRPC
occurring a log.
What was we create one connection per container to a nodeManager and this per-container connection wasn't closed after its use. Soon, the number of threads created by Hadoop RPC per connection reaches the ulimit on the node's number of processes and java beautifully describes it as an out-of-memory error.
Semantic (resource leak)
The fix is to set “IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY” to 0, basically tearing down the IPC connection immediately after it is used.
The reason that developers NOT explicitly close the RPC is that the underlying layer is messed up so that even he tried to call close at the beginning, he found it didn’t really close the RPC. In other words, this fix --- set idle time for connections to zero --- is a workaround (or hack).