Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Looks like there seems be some problem in the connectivity between those two machines in your network. (IP conflict or faulty router ?)

Please verify the connectivity between those two machines and if the problem still exits, try running both the servers in the same box (on different ports) and see if you can reproduce the problem.

cheers,
Saravanan

A fix for the assertion in the passive server was pushed into trunk rev 5840. Please use tomorrows nightly build to verify this.

thanks,
Saravanan

A fix for this issue was pushed into trunk in rev 5838. You can test it with tomorrow's nightly build.

Saravanan

Currently there is no way of configuring this behavior. When the TC server is down it is a critical failure and the clients wait for the server to recover before proceeding forward.

It is certainly possible to have this behavior configurable and have a timeout value after which an Exception is thrown which the users could catch and handle any which way they want.

Please open a feature request JIRA and we will look into it.

thanks,
Saravanan

Currently you cannot specify the weight but there was a discussion to support that in future. I would suggest that you open a jira if you want that feature.

we have our own election algorithm where everybody votes for themselves with a weight and then come to concensus and picks a winner with max weight.

Normally under this condition, its the connect clients (L1s) that determine which server wins.

For example if there are 2 clients connected to server s1 and you pull the cable to s1, then both the clients connect to the server s2. now when s1 comes back , s1 and s2 negotiates and s1 backs of (even though it might have been active the longest) because the cluster has moved forward with s2.

But if there are either no clients connected or there is a network error where 1 client is still connected to s1 and 1 client is connected to s2, then there is no way we could determine which one to back off. In these scenario terracotta servers notifies that there is a possibility of split-brain and it is expected that the operator decides which server to back off.

The WARNING message from server 1 and the groupmanager messages from server two tells me that this may be a network setup issue.

The server hns3g102 seems to bind to the localhost IP instead of the internet IP or NATed IP. Please check the machine's network configuration to make sure that the host name is resolvable to an external IP address and not to localhost IP (127.0.0.1).

The other option is to actually give the IP addresses instead of the hostname in the tc-config.xml.

BTW, we have a new release 2.4.3 that was out yesterday. Seems like you are running an old load.

thanks,
Saravanan

This is a problem with Enums which got fixed recently. Please try the latest nightly builds from http://www.terracotta.org/confluence/display/orgsite/Download' target='_new' rel="nofollow"> http://www.terracotta.org/confluence/display/orgsite/Download

The next patch release 2.4.3 which will be released sometime next week will contain the fix as well.

thanks,
Saravanan

Hi Michael,

Looks like the Terracotta server is OOMEing.

What I do is syncing every second on block 10.000 key-value pairs <String, String>

Are you saying that you are adding a lot of Strings into a shared Map ?
Strings are treated as literals in Terracotta and currently a Map containing a lot literals cannot be partially stored in memory. What this means in that collections containing literals should be able to fit in memory. It is anyway not a good idea to have one collection containing millions of literals.

If you are adding millions of Strings into a flat structured collection (Maps ,Lists etc.) you might end up with an OOME. It is better to partition your data into a deeper graph, a Map of Maps for example, not only for Terracotta but also for faster lookups and writes.

Hope that helps,
Saravanan

Hi Jamie,

Yes, synchronized(this) with auto locks should work if you dont want to synchronize every setter and will actually be more efficient when changing
multiple fields at once.

Coming back to the problem, are you running your L2 server in persistent mode ? If so, are you cleaning up the data directory when you are restarting the server ? There might be some stale data from old runs that is causing this issue.

Also send us the logs from the client and the server. That might give us some clue.

thanks,
Saravanan

Hi Jamie,

Is this consistently reproducible ? If so, is it possible to share your app with us ? If not can you write a reproducible testcase and share that with us ? You can contact me at ssubbiah at terracottatech dot com.

On a side note, I saw that you used a lot of named lock in your config. Without looking at your code, I cant say this for sure, but it looks to me like you may want to use auto locks instead.

Named locks protects against concurrent access to sections of code under the same named lock. For example, if you want only one thread to either execute put() or putIfAbsent() in your case, then you probably want to either give the same name for both locks or use auto locks.

Named locks are only useful when you dont have access to the source code but still want to use terracotta to cluster. Auto locks are generally better.

Check out http://www.terracotta.org/confluence/display/docs1/Concept+and+Architecture+Guide#ConceptandArchitectureGuide-Locks for more info on locks.

thanks,
Saravanan

Hi Bikram,

I think you may have accidentally uploaded the same terracotta-server.log twice.

Can you please add the other log ? Also please try with the latest nightly build from trunk ? It can be found at http://www.terracotta.org/confluence/display/orgsite/Download

Saravanan

But still facing the problem with Network based Active-Passive. I mapped my "logfiles" folder on the network and started with 2 tc servers on 2 different machines.

I am a little confused. Do you mean that you are using the same data directory over the Network using NFS or SAMBA when you say you "mapped your logfiles on the network" ? For Network based Active-Passive you dont have to do that. The data is synched over the network by TC server and you dont have to have a shared disk (over network or otherwise)

Disk-based Active Passive :
1) Set persistency to permanent-store in tc-config
2) Have a common shared disk for data (NFS, SMB, SAN or local disk)
3) Make sure that the tc.property l2.ha.network.enabled is NOT set to true.

Network-based Active Passive :
1) Set persistency to temporary-swap-only in tc-config
2) Have a differernt data directory for each server
3) Make sure that the tc.property l2.ha.network.enabled is set to true.

I also recommend that you use the latest nightly build from trunk for testing Network-based Active Passive.

Hopefully that helps.

Saravanan

PS : You could catch us in our IRC channel if you need further help.

I see from the logs that you have enabled Network based Active-Passive. If you want to use the same data directory, you probably want to try disk based ACTIVE-PASSIVE, in which case disable Network based ACTIVE-PASSIVE. (It is in the tc.properties and turned off by default)

Also clean up data directory, kill all clients and start fresh.