Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Sun JVM, 1.6. Terracotta 3.7. 28GB Heap, startup options:

/usr/java/latest/bin/java -server -Xms28672m -Xmx28672m -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCTimeStamps -XX:+PrintGC -XX:+UseConcMarkSweepGC

Date=Fri Jan 04 19:54:51 GMT 2013, startMem=24516827, endMem=24405072, reclaimed=111755, timeTaken=0.234951 (1172271.092: [GC 24516827K->24405072K(29343104K), 0.2349510 secs])
Date=Fri Jan 04 19:54:52 GMT 2013, startMem=24541392, endMem=24428143, reclaimed=113249, timeTaken=0.217251 (1172272.112: [GC 24541392K->24428143K(29343104K), 0.2172510 secs])
Date=Fri Jan 04 19:54:53 GMT 2013, startMem=24564463, endMem=24453564, reclaimed=110899, timeTaken=0.197431 (1172273.156: [GC 24564463K->24453564K(29343104K), 0.1974310 secs])
Date=Fri Jan 04 19:54:54 GMT 2013, startMem=24589884, endMem=24478495, reclaimed=111389, timeTaken=0.211244 (1172274.139: [GC 24589884K->24478495K(29343104K), 0.2112440 secs])
Date=Fri Jan 04 19:54:56 GMT 2013, startMem=24614815, endMem=24505482, reclaimed=109333, timeTaken=61.301987 (1172276.518: [GC 24614815K->24505482K(29343104K), 61.3019860 secs])
Date=Fri Jan 04 19:56:08 GMT 2013, startMem=24641802, endMem=24529546, reclaimed=112256, timeTaken=2.68437 (1172348.921: [GC 24641802K->24529546K(29343104K), 2.6843700 secs])
Date=Fri Jan 04 19:56:14 GMT 2013, startMem=24665866, endMem=24552912, reclaimed=112954, timeTaken=1.391239 (1172354.608: [GC 24665866K->24552912K(29343104K), 1.3912390 secs])
Date=Fri Jan 04 19:56:18 GMT 2013, startMem=24689232, endMem=24583021, reclaimed=106211, timeTaken=1.607887 (1172358.405: [GC 24689232K->24583021K(29343104K), 1.6078870 secs])
Date=Fri Jan 04 19:56:22 GMT 2013, startMem=24719341, endMem=24613800, reclaimed=105541, timeTaken=1.015518 (1172362.232: [GC 24719341K->24613800K(29343104K), 1.0155180 secs])
Date=Fri Jan 04 19:56:24 GMT 2013, startMem=24750120, endMem=24644579, reclaimed=105541, timeTaken=0.958144 (1172364.844: [GC 24750120K->24644579K(29343104K), 0.9581440 secs])
Date=Fri Jan 04 19:56:32 GMT 2013, startMem=24780899, endMem=24665640, reclaimed=115259, timeTaken=0.15122 (1172372.355: [GC 24780899K->24665640K(29343104K), 0.1512200 secs])

Short of adding more verbose GC log options, as seen below, are there any other recommendations for finding the root cause for this? It seems only 100MB is free'd during those 60 seconds.

'-XX:+PrintGCDateStamps' Print date stamps instead of relative timestamps
'-XX:+PrintGCDetails' Will show what cpu went for (user, kern), gc algorithm used
'-XX:+PrintHeapAtGC' will show all of the heaps memory containers and their usage
'-Xloggc:/path/to/dedicated.log' log to specific file

These look fine but apart from this can you use following jvm parameters which are recommended :

-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:NewRatio=3 -XX:+UseCompressedOops

"UseParNewGC" will be added as suggested, but as all clients dropped of I'm pretty confident that this is a full GC and not a minor (even though the log reads "GC" and not "Full GC" - which I think is because it's PrintGC and not PrintGCDetails). The numbers indicates that only ~100MB of tenured was collected on those 50 seconds, so increasing the NewRatio to =3 would steal a significant amount of memory from the tenured space.

If I am correct in all above assumptions, major GC's will run more often due to smaller tenured space. Is this what you think will be beneficial, or did you just recommend whatever settings most use? We could double the heap, but I'm not confident that would improve anything.

Oh, very large heap and temporary-swap-only. EHcache recommends Concurrent Mark Sweep for large heaps (+6GB), and not Parallel. Any feedback on that?

Terracotta client log:
2013-01-04 19:55:34,247 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 1
2013-01-04 19:55:37,571 [tomcat-http--51] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 16887 ms to retrieve ObjectID=[534946145] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:55:37,571 [tomcat-http--49] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 15004 ms to retrieve ObjectID=[534119827] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:55:38,257 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 2
2013-01-04 19:55:42,265 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 3
2013-01-04 19:55:42,626 [tomcat-http--15] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 15004 ms to retrieve ObjectID=[533284430] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:55:46,276 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 4
2013-01-04 19:55:50,283 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 5
2013-01-04 19:55:52,574 [tomcat-http--51] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 31890 ms to retrieve ObjectID=[534946145] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:55:52,574 [tomcat-http--49] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 30007 ms to retrieve ObjectID=[534119827] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:55:54,320 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 6
2013-01-04 19:55:57,629 [tomcat-http--15] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 30007 ms to retrieve ObjectID=[533284430] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:55:58,329 [L1_L2:TCComm Main Selector Thread_R (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 might be in Long GC. Ping-probe cycles completed since last reply : 7
2013-01-04 19:56:06,899 [ClientLockManager LockGC] INFO com.tc.object.locks.ClientLockManager - ClientID[0]: Lock GC collected 87 garbage locks
2013-01-04 19:56:07,579 [tomcat-http--51] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 46895 ms to retrieve ObjectID=[534946145] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:56:07,579 [tomcat-http--49] WARN com.tc.object.RemoteObjectManager - ClientID[0]: Still waiting for 45012 ms to retrieve ObjectID=[534119827] depth : 500 parent : ObjectID=[-1]
2013-01-04 19:56:08,363 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - Socket Connect to gibux581-top:9510(callbackport:9510) taking long time. probably not reachable.
2013-01-04 19:56:08,363 [HealthChecker] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl. DSO Client - gibux581-top:9510 is DEAD
2013-01-04 19:56:08,364 [HealthChecker] ERROR com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO Client - Declared connection dead ConnectionID(0.1bcbccc868b542b0b3ce5a8dc689cb2a) idle time 42217ms
2013-01-04 19:56:08,407 [L1_L2:TCComm Main Selector Thread_W (listen 0:0:0:0:0:0:0:0:55557)] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(0.1bcbccc868b542b0b3ce5a8dc689cb2a): CLOSE EVENT : com.tc.net.core.TCConnectionJDK14@1991261055: connected: false, closed: true local=10.1.51.241:34421 remote=10.1.98.111:9510 connect=[Sat Dec 22 06:22:21 GMT 2012] idle=182ms [25762642671 read, 21375347337 write]. STATUS : DISCONNECTED

you are using very big heap size. For Terracotta servers, we normally recommend to use ParallelGC and not to use Concurrent MarkSweep. For cleints, you can use any GC collector.
If you want to avoid disconnection of clients, you can increase health check setting. By default, TSA throws clients out after 45 sec in case log GC.

Hope this will help
Cheers!!