[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Messages posted by: ssubbiah  XML
Profile for ssubbiah -> Messages posted by ssubbiah [115] Go to Page: Previous  1, 2, 3, 4, 5, 6, 7, 8 Next 
Author Message
cool ! We understand. I will work on giving you something to run at your end and send us more info.

thanks,
Saravanan
Please back up the jdb files. Its crucial for figuring out this problem.
From what you say, it seems like it is happening to you consistently It will be easier for us to debug it if we have the jdb files so that we can write some tools to analyze the data.

Is that something that you can share with us ? If so please ping me at ssubbiah at terracottatech dot com and I can give you an ftp account to upload the files.

If not, I can give you a patched version of tc.jar that you can run against your data files and send us the logs.

It looks like something that shouldn't be null is null in the persisted version of ur data and everytime the Distributed Garbage collector hits it, it crashes.


cheers
This is being tracked here.

http://jira.terracotta.org/jira/browse/CDV-761
Again the log is for the passive server. (192.168.100.55)

Can you attach the log from the active server ? (192.168.100.50)

From the passive servers log, I see that there may have been some transient network problem between the active and the passive for about a second or so.


2008-05-14 08:40:14,031 [WorkerThread(group_events_dispatch_stage,0)] WARN com.tc.l2.ha.L2HACoordinator - NodeID[192.168.100.50:9530] left the cluster
....
2008-05-14 08:40:15,274 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.ha.L2HACoordinator - NodeID[192.168.100.50:9530] joined the cluster
 


This caused the active is request the passive to quit. If you want protect against such transient network failures, there are some configuration parameters. Our field engineers will be able to help u tune it.

I still dont see the active server quiting.

cheers,
I only see one server going down. Did the active server went down too ? If so please post both the logs.

This exception is normal when you start a passive server with a persistent database. The active server is asking the passive serve to quit because there is data in the persistent data store. If you clean up the store and then restart passive server then this wont happen.

In future TC versions, this will be automatic.

cheers,
Even in "temporary-swap" mode we write to disk when we are not able to fit all the objects in memory at the L2. So if you want to reduce the disk activity give L2 more memory.

The problem with giving more memory is that then GCs take longer time. I dont know if this is acceptable or not for you. One thing that we found to work well for us in the past is instead give the OS a lot of memory for disk cache. Your L2 GCs faster while you are still not hitting the disk for reads. Writes will go to disk eventually though.

Another trick is to run DGC (our distributed Garbage collection) more often so you avoid swapping out of garbage to disk. You GC them while they are in memory thus reliving the memory pressure. This will help only if you are creating a lot of Garbage which seems to be true for your use case.

Also there are some cachemanager properties that you can tweak to keep more objects in memory. As a first step, I would suggest that you enabled the cachemanager logging in tc.properties to see how many objects are getting swapped out.

Code:
 l2.cachemanager.logging.enabled = true
 


But like Ari said, if Disk IO is not your bottleneck, reducing it will show no improvement in ur throughput.

Hi Ben,

We are looking into this. Meanwhile I was wondering if you could share with us a reproducible testcase that showcases this problem if possible. That will greatly help us debug and fix this issue.

Also is this problem happening to you with terracotta 2.5.4 too ? If it is only happening with 2.6-stable load, can you try recreating this with the following property set in tc.properties ?

Code:
 l2.lockmanager.greedy.lease.enabled = false
 


cheers
We have pushed a fix for this issue in 2.6 Can you please try out the latest nightly builds to see if this problem is fixed for you ?
yeah, fixing the network glitch should help. If you are writing a script, I would suggest backing up the data files instead of deleting them just to be on the safe side.

Sadly, it would be much nicer if the active node would just tell the passive node to clean up, but obviously the software wasn't written that way. :-)
 



We deliberately didnt automate this because you could lose data in certain extreme failure scenarios if we did this. We wanted the operator to make this decision cautiously and didnt want to accidentally erase the db files. Also it is very easy for anyone to write a script around it to erase the files and start a server if thats what they want.

We are constantly improving our software and are building more resilient features into the product. Our new 2.6 software which is in beta has our new comms stack which is more tunable to protect against such intermittent failures.

cheers,
Looks like there seems to be some intermittent Network disruption. Node Terracotta01 left the cluster and rejoined it immediately. We have seen this happen a couple of times when there was some kind of network glitch.


2008-03-31 11:23:08,279 [WorkerThread(group_events_dispatch_stage,0)] WARN com.tc.l2.ha.L2HACoordinator - NodeID[tcp://Terracotta01:9530] left the cluster
2008-03-31 11:23:08,279 [WorkerThread(group_events_dispatch_stage,0)] WARN com.terracottatech.console - NodeID[tcp://Terracotta01:9530] left the cluster
2008-03-31 11:23:08,279 [WorkerThread(channel_life_cycle_stage,0)] INFO com.tc.objectserver.handler.ChannelLifeCycleHandler - Received transport disconnect. Shutting down client NodeID[tcp://Terracotta01:9530]
2008-03-31 11:23:08,280 [WorkerThread(channel_life_cycle_stage,0)] INFO com.tc.objectserver.persistence.impl.TransactionStoreImpl - shutdownClient() : Removing txns from DB : 0
2008-03-31 11:23:08,280 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.ha.L2HACoordinator - NodeID[tcp://terracotta01:9530] joined the cluster
2008-03-31 11:23:08,280 [WorkerThread(group_events_dispatch_stage,0)] INFO com.terracottatech.console - NodeID[tcp://terracotta01:9530] joined the cluster
 



Does this happen consistently ? Can you verify if you have any network errors, a fault card maybe ?

From the logging it looks like you are either runnint 2.6 or trunk since this warning is not present in 2.5.2 branch. Can you please verify ?


If you are running 2.6 or trunk, it is still in beta and recently we fixed some bugs and you should be seeing this warning anymore. Please try the latest nightly build and let us know.

If it happens again, can you please send us the logs ?
Even though this is also a sleepycat deadlock exception, this is not related to https://jira.terracotta.org/jira/browse/CDV-502. If you look closely there are two thread (GC thread and the fush stage tread) involved in the deadlock.

The strange thing about the exception you posted is that there seems to be no other thread holding the lock from the printing. Sleepycat is kind of dumb in deadlock deduction in that they dont really do deadlock deduction. If a thread is waiting for lock for a long time, longer than the lock timeout, they throw this exception.

Now there could be a deadlock, but I cant see it in this log. One possibility is that if your server is running too slow because of swapping and long GC, lock aquire could timeout with this exception.

So, did you monitor the L2 while this happened ? How long did GC take ? Was the machine swapping ? jstat and vmstat should give you a good idea.

Also is this consistently reproducible ? Can you please try the latest trunk nightly to see if you can get this to happen ? In the trunk, we print the entire thread dump on these error conditions and will be useful in debugging.

thanks,
Saravanan
The fault graph that you are seeing in the admin console is the faulting of objects to the L1 from the L2. There is a cache hit rate graph that corresponds to the number of objects faulted from the disk to L2. Does that spike too ?

Also you will see transactitons in the server only when you modify the shared objects in the graph. So when txn rate flate lines what are the clients doing ?
 
Profile for ssubbiah -> Messages posted by ssubbiah [115] Go to Page: Previous  1, 2, 3, 4, 5, 6, 7, 8 Next 
Go to:   
Powered by JForum 2.1.7 © JForum Team