Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

A small correction to the post above.

l1.cachemanager.percentageToEvict property controls the % to evict from the L1 to the L2 while l2.cachemanager.percentageToEvict controls the % to evict from the L2 to disk.

Like I said earlier 5 sec is the default election time which can be changed using the property l2.ha.electionmanager.electionTimePeriod . It was set to 5 seconds to accommodate really slow machines in our test environment.

2007-05-10 19:03:50,238 INFO - Moved to State[ PASSIVE-UNINITIALIZED ]

There are 2 PASSIVE states that a server can be in. PASSIVE-UNINITIALIZED and PASSIVE-STANDBY. In PASSIVE-UNINITALIZED state, the PASSIVE server is still initializing its state from the active. Once it initializes its state from the active, it goes to PASSIVE-STANDBY state. Until it does so, it cant participate in elections and become ACTIVE.

So what you are seeing is normal. Please wait till the passive goes to PASSIVE-STANDBY before killing the ACTIVE. If you bring up both ACTIVE and PASSIVE simultaneously, this will be instantaneous.

PS : I see that you have enabled permanent-store which is not really needed for networked A/P. When you restart servers please make sure you clean up the Data store as the servers will initialize state from the other node.

PPS : There are more changes, error checks and fixes coming in the next few weeks for this feature. So try the latest nightly builds whenever possible.

It looks like PASSIVE tc server immediately notified the signal(ACTIVE tc left the cluster). But it requires 5 seconds in order to become ACTIVE like as following log messages:

This is the election time that is configurable in the tc.properties, l2.ha.electionmanager.electionTimePeriod in ms. The PASSIVE servers run an election to determine the next ACTIVE and waits for the electionTimePeriod ms to get in all the votes and decide the winner. Depending on your network latency and load average you might be able to reduce this value, esp if you are having only one PASSIVE.

But tc server looks like hang if I use client before PASSIVE became ACTIVE. In this case, I have to start killed tc server.
I only can use client after the killed tc server running again.

I am not sure why this is the case. Do you see anything getting printed in the server logs when this happens ?

Ideally once when the PASSIVE becomes ACTIVE, the clients should be able to connect and proceed seamlessly .

Can you share the logs with us when this happens ? Is it consistently reproducible ? You can send the logs ( and the app if possible) to ssubbiah at terracottatech.com

thanks,
Saravanan

PS : BTW I assume that you are working with the nightly builds from trunk ?

Please note that Networked Active/Passive is still in Beta and if you want to really try it out, I would suggest that you try the trunk nightly builds instead of 2.3

To try networked Active/Passive, you have to do the following.
1) Enable the property l2.ha.network.enabled to true in tc.properties
2) Make sure the data directory for each server is unique (i.e. not shared)
3) Make sure the ports are different if you are running both the servers in the same machine. Note that it uses two consecutive ports for L2-L2 and L2-L1 communication, so the dso-port has to be at least 2 ports apart.

Some of these will change over the next few weeks in trunk and there will actually be config options for these.

If you still cant get it to work, send us the L2 logs.

Saravanan

Its a bug in our instrumentation of LinkedHashMap. Juris is in the process of fixing it now. The status can be tracked here.

https://jira.terracotta.org/jira/browse/CDV-236

We have pushed a fix for the server crash on to 2.2.1 and truck. If you are interested you could checkout the latest source and build a kit for yourself. The instructions are at http://www.terracotta.org/confluence/display/orgsite/Building+Terracotta

The 2.2.1 release that is planned for next week will contain this fix.

Cheers
Saravanan

We are in the process of fixing the server crash problem due to misbehaving clients. This fix will be in the next release candidate that we post for 2.2.1

From what you say it seems to me like the problem could likely be 6) as explained in my previous post.

Is there any chance that you will be able to share your application with us ? You could contact one of your field engineers [ siyer at terracottatech dot com ] and they will be able to help out with this issue.

There seems to be two issues here.

1) The client throwing an OOME

2) The server crashing when the above happens during commit of a transaction. This is being tracked in our JIRA. http://jira.terracotta.org/jira/browse/CDV-111

So, for your case, if you fix 1 then 2 shouldn't happen. There are several things to consider for fixing the OOME. Without knowing your specific application I will only be able to guess here. But all of the following should help to some extent.

1) Increasing the memory given to the clients
2) Changing the shared data structure such that they are more hierarchical and not flat.
3) Using HashMaps or Hashtables to hold data in large collections instead of say in Lists. Terracotta can handle these map structures better than some other structures now and dynamically swap parts of the collection in and out of client vm's heap.
4) partitioning the data so that any client accesses only a portion of the data at any one time.
5) Seeing how the memory grows to see if the memory increases gradually or suddenly ( visual gc will help here for heap usage) and checking for some memory leak in the application.
6) Checking the transactional boundaries to see if the transactions are huge, ie many many changes to shared data within 1 lock scope. Generally this is fine, though it imposes bigger constraints on the memory requirement for your clients.

One other thing that you can try is enabling logging in the L1 and looking at the output in the logs when this OOME happens.

For this you have to create a properties file called "tc.properties" and place it in the same directory as tc.jar (generally under $TC_INSTALL/common/lib directory). The property file should contain the following lines in it.

l1.cachemanager.logging.enabled = true
l1.transactionmanager.logging.enabled = true

There are a few other properties in the L1 cachemanager that affects the way the virtual heap feature works. Send us the details of the output and we will be able to better understand your app.

Hopefully this helps,

regards,
Saravanan

Hi,

1) Is this consistently reproducible ?

2) Are you running 2.2.1 ? Just wanted to make sure.

3) Do you see any exceptions in one of the clients ? Please check the client logs to see if you see an OOME or some other type of exception in one of the clients. It seems to me like a misbehaving client caused due to some exception (OOME maybe ?).

4) Also can you try to reproduce it without killing any clients ?

Your answer will help us narrow down the problem and give you a patch to try.

thanks,
Saravanan

One thing to note is that if an object becomes shared and then becomes garbage, it still resides in the L2 (tc server) until a distributed gc is run. So in your test, if you are creating new shared objects and removing old ones, all the shared objects that were created is still live until the next gc cycle. I think the default GC time is once in 60 minutes. You can decrease this time to better suit your test.

Also since it is easy to overlook what objects actually becomes shared, you may want to verify the shared graph thru admin console to make sure that only the objects you want are accessible from the root.