[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
3.5.0 terracotta server deadlock  XML
Forum Index -> Terracotta Platform
Author Message
florescu

journeyman

Joined: 03/07/2008 15:58:51
Messages: 36
Offline

We recently experienced long kernel pauses (linux versions bellow 2.6.38) under severe memory pressure and L1s were dropped. When we restarted them, we noticed that terracotta server was not responsive, and when forcing a thread dump, we found a clear deadlock. The server had default logging level, and the log reveals nothing for the hours prior to L1s being dropped.

I have bellow the 2 deadlocked threads:

Java stack information for the threads listed above:
===================================================
"OOO Connection Restore Timer":
at com.tc.net.protocol.transport.MessageTransportBase.isConnected(MessageTransportBase.java:191)
- waiting to lock <0x00000000e0bdbb70> (a com.tc.net.protocol.transport.MessageTransportStatus)
at com.tc.net.protocol.delivery.OnceAndOnlyOnceProtocolNetworkLayerImpl.sendMessage(OnceAndOnlyOnceProtocolNetworkLayerImpl.java:364)
at com.tc.net.protocol.delivery.OnceAndOnlyOnceProtocolNetworkLayerImpl.close(OnceAndOnlyOnceProtocolNetworkLayerImpl.java:278)
at com.tc.net.protocol.tcm.AbstractMessageChannel.close(AbstractMessageChannel.java:135)
at com.tc.net.protocol.tcm.ServerMessageChannelImpl.close(ServerMessageChannelImpl.java:18)
at com.tc.net.protocol.tcm.ChannelManagerImpl.notifyChannelEvent(ChannelManagerImpl.java:103)
at com.tc.net.protocol.tcm.AbstractMessageChannel.fireEvent(AbstractMessageChannel.java:243)
at com.tc.net.protocol.tcm.AbstractMessageChannel.fireTransportDisconnectedEvent(AbstractMessageChannel.java:203)
at com.tc.net.protocol.tcm.AbstractMessageChannel.notifyTransportDisconnected(AbstractMessageChannel.java:199)
at com.tc.net.protocol.tcm.ServerMessageChannelImpl.notifyTransportDisconnected(ServerMessageChannelImpl.java:18)
at com.tc.net.protocol.delivery.OnceAndOnlyOnceProtocolNetworkLayerImpl.connectionRestoreFailed(OnceAndOnlyOnceProtocolNetworkLayerImpl.java:445)
at com.tc.net.protocol.delivery.OOOReconnectionTimeout.restoreConnectionFailed(OOOReconnectionTimeout.java:73)
- locked <0x00000000e0bdb1d8> (a com.tc.net.protocol.delivery.OOOReconnectionTimeout)
at com.tc.net.protocol.delivery.OOOReconnectionTimeout$TimeoutTimerTask.run(OOOReconnectionTimeout.java:90)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
"L2_L1:TCWorkerComm # 3_R":
at com.tc.net.protocol.delivery.OOOReconnectionTimeout.notifyTransportConnected(OOOReconnectionTimeout.java:54)
- waiting to lock <0x00000000e0bdb1d8> (a com.tc.net.protocol.delivery.OOOReconnectionTimeout)
at com.tc.net.protocol.transport.AbstractMessageTransport.fireTransportEvent(AbstractMessageTransport.java:105)
at com.tc.net.protocol.transport.AbstractMessageTransport.fireTransportConnectedEvent(AbstractMessageTransport.java:69)
at com.tc.net.protocol.transport.ServerMessageTransport.handleAck(ServerMessageTransport.java:88)
at com.tc.net.protocol.transport.ServerMessageTransport.verifyAndHandleAck(ServerMessageTransport.java:76)
at com.tc.net.protocol.transport.ServerMessageTransport.receiveTransportMessageImpl(ServerMessageTransport.java:50)
- locked <0x00000000e0bdbb70> (a com.tc.net.protocol.transport.MessageTransportStatus)
at com.tc.net.protocol.transport.MessageTransportBase.receiveTransportMessage(MessageTransportBase.java:91)
- locked <0x00000000e0bdbbb0> (a java.lang.Object)
at com.tc.net.protocol.transport.ServerStackProvider$MessageSink.putMessage(ServerStackProvider.java:238)
at com.tc.net.protocol.transport.WireProtocolAdaptorImpl.addReadData(WireProtocolAdaptorImpl.java:54)
at com.tc.net.protocol.ProtocolSwitch.addReadData(ProtocolSwitch.java:50)
at com.tc.net.core.TCConnectionImpl.addNetworkData(TCConnectionImpl.java:687)
at com.tc.net.core.TCConnectionImpl.doReadInternal(TCConnectionImpl.java:365)
at com.tc.net.core.TCConnectionImpl.doRead(TCConnectionImpl.java:227)
at com.tc.net.core.CoreNIOServices$CommThread.selectLoop(CoreNIOServices.java:625)
at com.tc.net.core.CoreNIOServices$CommThread.run(CoreNIOServices.java:294)
ari

seraphim

Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline

3.5.0 is not the latest minor on the 3.5-line, right? Can you not apply all the updates at least? To 3.5.4, etc?

http://docs.terracotta.org/confluence/display/release/3.5.4+Downloads
[WWW]
florescu

journeyman

Joined: 03/07/2008 15:58:51
Messages: 36
Offline

Considering that we have not yet managed to reproduce this deadlock, and that we cannot yet upgrade, I am interested to find out if higher versions (3.5.x or 3.6.x) have fixes in this area that could potentially have fixed this. If I cannot easily reproduce, how can I prove that 3.5.4 fixes it? Thanks for your help
ari

seraphim

Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline

I went through release notes:
http://docs.terracotta.org/confluence/display/release/Terracotta+3.5+Release+Notes

And don't see any repaired deadlocks in the comms layer.

What I do see on closer examination of your stack trace is that it seems you might have been disconnected from the cluster, not deadlocked.

GC pause? Healthchecker kicked in?

I would look for a smoking gun log entry around long GC being detected and disconnect / reconnect activity right before this assumed-deadlock.

--Ari
[WWW]
twu

journeyman

Joined: 10/18/2011 12:48:09
Messages: 21
Offline

This is a known deadlock issue with us firing notifications under too broad a lock scope. It's been fixed in 3.6.2.
florescu

journeyman

Joined: 03/07/2008 15:58:51
Messages: 36
Offline

Thanks twu! Do you have a bug tracing number for the fix?
twu

journeyman

Joined: 10/18/2011 12:48:09
Messages: 21
Offline

Yes, it's DEV-7123. However it appears to have been missed from the release notes for 3.6.2. I've double checked our source repository and can confirm that the fix is indeed in the 3.6.2 release.
 
Forum Index -> Terracotta Platform
Go to:   
Powered by JForum 2.1.7 © JForum Team