Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Yes, that is true. Though the new "rejoin" feature in Terracotta 3.5/ Ehcache 2.4.1 would let the client connect back to the server

I think when huge datasets are to be synced to the passive, the system gets overwhelmed and the response time at the clients become bad. There are a few things that you can try here.

1. You can try scaling out the Terracotta Server Array. That way each stripe manages lesser amount of data and you might not see this behavior.

2. BigMemory helps to a certain extent by keeping all that data in memory and making everything very fast.

3. You can also try to throttle the syncing, thus slowing down the syncing speed and giving some breathing time for the system to process client transactions.

l2.objectmanager.passive.sync.batch.size = 50 (default is 500)
l2.objectmanager.passive.sync.throttle.timeInMillis = 1 (default 0)

You could try a few different values for for these two properties.

Also 3.5 release which is due in a few days has improved syncing capabilities addressing this issue among other things.

Hope that helps.

With BigMemory GC tuning became unnecessary. Without BigMemory some usecases didnt show that significant improvement with those settings over defaults. So it was left out.

When passive takes over as active, it waits for all know clients to connect to it so the cluster can resume from old state. In your test 3, since you kill one client too, the new active is basically waiting for the dead client to connect or until its reconnect window closes after which it will proceed normally.

I think the reconnect window by default is 120 seconds and can be configured in tc-config. During this time things look like its stuck, but its just waiting for clients to reconnect.

Can you post the entire logs and the cluster thread dumps please. Locks state are not persisted in the disk, so that shouldn't be the problem.

If you don't want to post it the forums, you could send it to serverteam at terracottatech dot com

The full logs of the server and the client would be helpful to understand what is going on.

A few observations from the logs you sent.

1. Seems like young gc is not helping you as its not really collecting any objects. So I would suggest jsut disabling it as it only wasting CPU and memory for ur usecase.

2. The total object count in your app seems very low, in the order of 6000. Is that what is expected ? The last full GC printed this.

Code:

 2010-06-05 14:51:27,581 [DGC-Thread] INFO com.tc.objectserver.dgc.impl.MarkAndSweepGarbageCollector - DGC[ 4224 ] complete : GarbageCollectionInfo [ Iteration = 4224 ] =  type  =  full,  startTime = 1275706279153 begin object count = 6035 end object count = 5872 markStageTime = 8425 pauseStageTime = 1 totalMarkCycleTime = 8428 candiate garabage  count = 166 actual garbage count = 166 pre rescue count = 176 rescue1Time = 1 rescue 1 Count = 166 rescue2Time = 1

3. The app seems to be making bigger and bigger transactions and I think this is the root cause you need to track.

It starts out with this
Code:

 2010-06-05 09:11:40,913 [WorkerThread(apply_changes_stage, 0)] WARN com.tc.object.dna.impl.BaseDNAEncodingImpl - Attempt to read a byte array of len: 8005638; threshold=8000000
 2010-06-05 09:11:40,921 [WorkerThread(l2_ooo_net_send_stage, 12, 0)] WARN com.tc.net.core.TCConnection - Warning: Attempting to send a messaage of size 8006280 bytes

And it keeps growing to this and then OOMEs.
Code:

 2010-06-05 14:52:26,150 [WorkerThread(apply_changes_stage, 0)] WARN com.tc.object.dna.impl.BaseDNAEncodingImpl - Attempt to read a byte array of len: 76275566; threshold=8000000
 2010-06-05 14:52:26,279 [WorkerThread(l2_ooo_net_send_stage, 2, 0)] WARN com.tc.net.core.TCConnection - Warning: Attempting to send a messaage of size 76276208 bytes

So somewhere in the app, either some datastructure like a map is growing indefinitely or someone is making bigger and bigger transactions. If you could identify that then it will help solving this problem.

Definitely shorter session expiry should help more objects becoming garbage and thus should help DGC walk thru smaller graph during each collection. So should help in shorter DGC time though the first few runs its probably gonna lose what ever it gained in Mark phase in delete phase to delete a huge set fo object.

Bringing the "begin count" to a stable number would be an important thing to do. An ever increasing size of the set of live objects is not a good sign.

Also when you say that the TC Servers are now running at constant 75% CPU, is it because of increase load you are throwing at the cluster ? In other words is the load directly related to the TPS the server is processing or is it because of constant GC at the server or someother side effect of higher load/bigger data set ?

Depending on what is causing the slowness, you could try increasing the heap size, do some berkeley db tuning or do some DGC tuning like increasing the frequency etc.

Another way to scale out your cluster is by adding another stripe and thus distributing the load. If you expect to sustain this high load for a long period of time or want to protect against these spikes in future, that might not be a bad idea. It will also help bringing DGC timings down.

Also please send both the active and passive server logs so we can compare the stack traces that caused the server to die. Ideally it would have been good to get the object database too to do some forensics but I guess you were running in non-persistent mode so it will be of little use.

You could also open a JIRA at jira.terracotta.org and attach the logs there.

If you dont want to have the logs in public, please email it to ssubbiah at terracottatech.com and I will forward it to the engineering team here.

thanks

ThreadIDs are created one per thread in the client VM. They are used to identify/associate lock requests/responses to threads. They are stored in a thread local variable in ThreadIDManager and in a WeakHashMap in one other place. So it should not leak.

If your app is creating a lot of threads and if each thread is locking on aa Reentrant lock and never releasing the lock before dying, then this will cause locks to leak as the system doesnt know when an application thread dies and as a result ThreadIDs will also leak. If this is the case, it has to be fixed in the app. (Always use try/finally to grab/release ReentrantLocks) Not sure if that is what is happening in your app.

If this is not the case and you have steps to reproduce the leak, please share it with us and we will look into it.

The logs themselves didnt give much clue except it shows that there are 4 or 5 objects that went missing. It is possible that these objects were garbage collected, why is still not known.

About the improper use of locks causing this, if you do something like
Code:

 synchronized (map) {
  map.put(k1,v1);
 }

and read without acquiring the lock like
Code:

 v = map.get(k1);

Or if you use two different locks, one to read and one to write, like
Code:

 synchronized(lock1) {
   map.put(k1,v1);
 }

and do
Code:

 synchronized(lock2) {
   v = map.get(k1);
 }

then, you could get stale state from the map due to the race caused by wrong locks. That could cause this exception. Without looking at your code, cant say that is the reason.

One thing you can try is to disconnect both your clients and restart them and see if the exception repeats. If the stale references were only present in the client then you wont see the exception again. If the stale reference is also present in the server, i.e. if one of the object is referencing these missing objects, then you will still see the exception.

If you could consistently reproduce this with some usecase that you are willing to share with us, then that will help us resolve this issue at the earliest.

Would it be possible to attach all the logs , client before/after restart and server logs? It might give us some clue.

Are you able to consistently reproduce it ? If so, please add the exact steps to reproduce it. That would greatly help us.

It is possible that if the locking is not right, one could get object not found exception. ie, if you read/write using different locks or under no locks. But generally the problem will be transient in that case.

Starting two clients or killing one client shouldnt cause this behavior.

From your logs :
Code:

 2010-01-27 05:29:48,752 [main] INFO com.terracottatech.dso - This L2 Node ID = NodeID[ec2-.compute-1.amazonaws.com:9510]
 2010-01-27 05:29:48,752 [main] INFO com.tc.l2.state.ElectionManagerImpl - Election Started : Enrollment [ NodeID[ec2-compute-1.amazonaws.com:9510], isNew = true, weights = 911706895560434739,-2357268809839870500 ]
 2010-01-27 05:29:53,753 [main] INFO com.tc.l2.state.ElectionManagerImpl - Election Complete : [Enrollment [ NodeID[ec2-.compute-1.amazonaws.com:9510], isNew = true, weights = 911706895560434739,-2357268809839870500 ]] : State[ Election-Complete ]
 2010-01-27 05:29:53,754 [main] INFO com.tc.l2.state.StateManagerImpl - Becoming State[ ACTIVE-COORDINATOR ]
 2010-01-27 05:29:53,754 [main] INFO com.terracottatech.console - Becoming State[ ACTIVE-COORDINATOR ]

So this server did not sync up the state from the current active on start up. If you bring all running terracotta servers down (for maintenance), you have to bring the previously active server up first and wait for it to become active again before bringing the other servers up, that way the latest state is synced to all servers.

As long as you have at least one active server up or follow the above mentioned steps, you will not encounter this problem. And your data should be safe with the last active server anyways.

Not sure if this is exactly what you are doing from your description, but this exception can happen if you stop the active server (s1) and let the passive take over (s2) and then stop s2 and then start s1 with the old persistent database files.

Since the cluster has moved forward with s2, the state in the DB in s1 is not current. So you have to either start s2 first so it takes over as active again or start s1 before you kill s2 so the current state is synced between the servers before killing s2.

Actually he is running in Networked active passive mode so it shouldn't matter if he is running in "persistent-store" or "temporary-swap" mode.