Author |
Message |
05/29/2012 16:18:05
|
florescu
journeyman
Joined: 03/07/2008 15:58:51
Messages: 36
Offline
|
We recently experienced long kernel pauses (linux versions bellow 2.6.38) under severe memory pressure and L1s were dropped. When we restarted them, we noticed that terracotta server was not responsive, and when forcing a thread dump, we found a clear deadlock. The server had default logging level, and the log reveals nothing for the hours prior to L1s being dropped.
I have bellow the 2 deadlocked threads:
Java stack information for the threads listed above:
===================================================
"OOO Connection Restore Timer":
at com.tc.net.protocol.transport.MessageTransportBase.isConnected(MessageTransportBase.java:191)
- waiting to lock <0x00000000e0bdbb70> (a com.tc.net.protocol.transport.MessageTransportStatus)
at com.tc.net.protocol.delivery.OnceAndOnlyOnceProtocolNetworkLayerImpl.sendMessage(OnceAndOnlyOnceProtocolNetworkLayerImpl.java:364)
at com.tc.net.protocol.delivery.OnceAndOnlyOnceProtocolNetworkLayerImpl.close(OnceAndOnlyOnceProtocolNetworkLayerImpl.java:278)
at com.tc.net.protocol.tcm.AbstractMessageChannel.close(AbstractMessageChannel.java:135)
at com.tc.net.protocol.tcm.ServerMessageChannelImpl.close(ServerMessageChannelImpl.java:18)
at com.tc.net.protocol.tcm.ChannelManagerImpl.notifyChannelEvent(ChannelManagerImpl.java:103)
at com.tc.net.protocol.tcm.AbstractMessageChannel.fireEvent(AbstractMessageChannel.java:243)
at com.tc.net.protocol.tcm.AbstractMessageChannel.fireTransportDisconnectedEvent(AbstractMessageChannel.java:203)
at com.tc.net.protocol.tcm.AbstractMessageChannel.notifyTransportDisconnected(AbstractMessageChannel.java:199)
at com.tc.net.protocol.tcm.ServerMessageChannelImpl.notifyTransportDisconnected(ServerMessageChannelImpl.java:18)
at com.tc.net.protocol.delivery.OnceAndOnlyOnceProtocolNetworkLayerImpl.connectionRestoreFailed(OnceAndOnlyOnceProtocolNetworkLayerImpl.java:445)
at com.tc.net.protocol.delivery.OOOReconnectionTimeout.restoreConnectionFailed(OOOReconnectionTimeout.java:73)
- locked <0x00000000e0bdb1d8> (a com.tc.net.protocol.delivery.OOOReconnectionTimeout)
at com.tc.net.protocol.delivery.OOOReconnectionTimeout$TimeoutTimerTask.run(OOOReconnectionTimeout.java:90)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
"L2_L1:TCWorkerComm # 3_R":
at com.tc.net.protocol.delivery.OOOReconnectionTimeout.notifyTransportConnected(OOOReconnectionTimeout.java:54)
- waiting to lock <0x00000000e0bdb1d8> (a com.tc.net.protocol.delivery.OOOReconnectionTimeout)
at com.tc.net.protocol.transport.AbstractMessageTransport.fireTransportEvent(AbstractMessageTransport.java:105)
at com.tc.net.protocol.transport.AbstractMessageTransport.fireTransportConnectedEvent(AbstractMessageTransport.java:69)
at com.tc.net.protocol.transport.ServerMessageTransport.handleAck(ServerMessageTransport.java:88)
at com.tc.net.protocol.transport.ServerMessageTransport.verifyAndHandleAck(ServerMessageTransport.java:76)
at com.tc.net.protocol.transport.ServerMessageTransport.receiveTransportMessageImpl(ServerMessageTransport.java:50)
- locked <0x00000000e0bdbb70> (a com.tc.net.protocol.transport.MessageTransportStatus)
at com.tc.net.protocol.transport.MessageTransportBase.receiveTransportMessage(MessageTransportBase.java:91)
- locked <0x00000000e0bdbbb0> (a java.lang.Object)
at com.tc.net.protocol.transport.ServerStackProvider$MessageSink.putMessage(ServerStackProvider.java:238)
at com.tc.net.protocol.transport.WireProtocolAdaptorImpl.addReadData(WireProtocolAdaptorImpl.java:54)
at com.tc.net.protocol.ProtocolSwitch.addReadData(ProtocolSwitch.java:50)
at com.tc.net.core.TCConnectionImpl.addNetworkData(TCConnectionImpl.java:687)
at com.tc.net.core.TCConnectionImpl.doReadInternal(TCConnectionImpl.java:365)
at com.tc.net.core.TCConnectionImpl.doRead(TCConnectionImpl.java:227)
at com.tc.net.core.CoreNIOServices$CommThread.selectLoop(CoreNIOServices.java:625)
at com.tc.net.core.CoreNIOServices$CommThread.run(CoreNIOServices.java:294)
|
|
|
05/29/2012 21:18:02
|
ari
seraphim
Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline
|
3.5.0 is not the latest minor on the 3.5-line, right? Can you not apply all the updates at least? To 3.5.4, etc?
http://docs.terracotta.org/confluence/display/release/3.5.4+Downloads
|
|
|
05/30/2012 11:20:19
|
florescu
journeyman
Joined: 03/07/2008 15:58:51
Messages: 36
Offline
|
Considering that we have not yet managed to reproduce this deadlock, and that we cannot yet upgrade, I am interested to find out if higher versions (3.5.x or 3.6.x) have fixes in this area that could potentially have fixed this. If I cannot easily reproduce, how can I prove that 3.5.4 fixes it? Thanks for your help
|
|
|
05/30/2012 16:40:26
|
ari
seraphim
Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline
|
I went through release notes:
http://docs.terracotta.org/confluence/display/release/Terracotta+3.5+Release+Notes
And don't see any repaired deadlocks in the comms layer.
What I do see on closer examination of your stack trace is that it seems you might have been disconnected from the cluster, not deadlocked.
GC pause? Healthchecker kicked in?
I would look for a smoking gun log entry around long GC being detected and disconnect / reconnect activity right before this assumed-deadlock.
--Ari
|
|
|
05/30/2012 17:01:05
|
twu
journeyman
Joined: 10/18/2011 12:48:09
Messages: 21
Offline
|
This is a known deadlock issue with us firing notifications under too broad a lock scope. It's been fixed in 3.6.2.
|
|
|
05/31/2012 09:40:34
|
florescu
journeyman
Joined: 03/07/2008 15:58:51
Messages: 36
Offline
|
Thanks twu! Do you have a bug tracing number for the fix?
|
|
|
05/31/2012 11:41:02
|
twu
journeyman
Joined: 10/18/2011 12:48:09
Messages: 21
Offline
|
Yes, it's DEV-7123. However it appears to have been missed from the release notes for 3.6.2. I've double checked our source repository and can confirm that the fix is indeed in the 3.6.2 release.
|
|
|
|