[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Using 2 HA Terracota servers  XML
Forum Index -> Terracotta Platform
Author Message
lvuong

neo

Joined: 06/13/2008 01:44:45
Messages: 8
Offline

Hi,

i am focusing on an HA architecture with 2 Terracota servers:

When the LAN between the 2 servers is OK, at one moment, there is only 1 Active Terracota server.

But imagine the LAN between the 2 servers is down: Each server will become active (Is it true ?), so that the content of each server may become different from each other.

Now, the LAN is re-established between the 2 servers : What is occuring ?
Are the 2 servers abe to merge their contents ? or is there a manual procedure for restarting correctly the 2 HA servers ?

Regards
LAurent
gbevin

praetor

Joined: 07/04/2007 09:09:42
Messages: 210
Offline

Hi,

I suppose you're talking about a networked active-passive setup. If that's the case, you can find the answer to your question here:
http://terracotta.org/confluence/display/docs1/Creating+a+Terracotta+Server+Cluster#CreatingaTerracottaServerCluster-Disadvantagesofrunninginnetworkmode

Hope this helps,

Geert

Want to post to this forum? Join the Terracotta Community
lvuong

neo

Joined: 06/13/2008 01:44:45
Messages: 8
Offline

Thank you,

the topic is interesting, but do not explain the Active/Active mode

perhaps with the admin console, we may recover from this situation ?

Laurent
gbevin

praetor

Joined: 07/04/2007 09:09:42
Messages: 210
Offline

Hi Laurent,

Active/active mode is not a feature that is available yet, but it will most probably be sometime in the future. Therefore we have no documentation about it yet.

Best regards,

Geert

Want to post to this forum? Join the Terracotta Community
ari

seraphim

Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline

gbevin is correct. There is no active / active in this case. What might be confusing you is that maybe there is a log statement in the passive TC server maybe stating that it became active?!? That wouldn't matter because no connect JVMs are talking to it.

Your assumption that 2 TC servers both become active is false. They both attempt to become active, but only one has all the connected JVMs bound to it. This one will remain active. The 2nd will think it is active but it will have no clients. When it comes back online (LAN link reestablished) the real active will zap / kill it and reimage it to the correct state automatically.



--Ari
[WWW]
tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

just to be clear, the situation you are talking about is often referred to as split brain (not active/active).

As Ari mentions, Terracotta has a number of features built in to prevent split brain, but in certain scenarios it is impossible for the product to tell.

Here's an example:

You have 4 Terracotta clients. 2 Terracotta servers. 2 clients and one server live on network partition A, and the other 2 clients and one server live on network partition B.

If you sever the link between partition A and B, then each of the two partitions will be exactly one half of the original cluster.

Here is what will happen with Terracotta:

In network partition A, the sever of the link will appear as a loss of the two clients that live in partition B, and the loss of the passive server which also resides in partition B. None of these events are fatal from the active cluster perspective, so this partition will carry on without any trouble.

In network partition B, the clients will see the split as a loss of their active server. They will attempt to failover to the passive server. The passive server will see the loss of the active server, and will elect itself as a master. It will receive the failover connections from the clients that are in its partition. And it will will wait for a period of time until *all* clients that were in the original cluster before the failover event (in this case, all 4 clients) connect to it. It *will not* proceed until all clients from the original cluster connect to it.

Now, the server in partition B can do a couple of things in this waiting state. By default, it will wait for 2 minutes for all clients to connect, before moving on. This value is configurable, so if you want it to move on before 2 minutes, you should set the client-reconnect-window setting lower. If you want it to wait longer (let's say 10 minutes, or 24 hours) then you should set it higher.

So you have a choice you can make about how you want Terracotta to respond in this situation. However, if you let the partition B proceed after some period of time (what if for example partition A is not actually working, but lost power) - then you can definitely have a split-brain situation. This is unavoidable given the scenario, and there is a huge amount of research into this area of how to prevent it.

The easiest thing you can do to completely prevent split-brain is to use the STONITH (shoot the other node in the head) method that is supported in red hat cluster with fencing devices. Fencing devices guarantee that one and only one partition will survive, as each partition in the cluster will try to power down the other side. Only one request for this power down event (STONITH) will succeed, leaving the one partition intact.

I suggest you google for STONITH and red hat clustering to read more on this topic.

I hope that helps clarify the situation.

EDIT: Here is a good doc to read: http://sources.redhat.com/cluster/wiki/FAQ/Fencing

[WWW]
lvuong

neo

Joined: 06/13/2008 01:44:45
Messages: 8
Offline

Hi mr Gautier,

thank you for explanation : So if i set the client-reconnect-window parameter for 12H00 for example, and if the cut of the link between the 2 servers is less than 12H00, when the link is re-established, in the previous case, the Terracota server of network partition A will stay as the Active one ?

regards
lvuong
tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

Yes that is correct (sorry for delay in responding I didn't see your question on this thread).

What will happen is that the server in partition A will see the loss of partition B and continue functioning.

In partition B, the server B will elect itself master, as it will see the loss of the clients and server in partition A. The 12H reconnect window will mean that after electing itself master, server B will wait for 12 hours for all clients to be reconnected before moving forward. Since this doesn't happen inside of 12 hours, partition B will not move forward.

Upon reconnect of the partition, inside the 12H window, the active server in partition A will send the server in partition B a "zap" request meaning that the server in partition B will be told to kill itself.
[WWW]
 
Forum Index -> Terracotta Platform
Go to:   
Powered by JForum 2.1.7 © JForum Team