Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Checkout the forum post http://forums.terracotta.org/forums/posts/list/594.page

It talks about similar problem.

The main problem is that the L1 in AS1 is not deducting the connection to AS2 is severed. We are looking at some solutions, will keep you posted.

BTW, when this happens, is your client (L1) in AS1 idle ? (ie not doing any work - reads/writes of shared objects) If so, can you make it do some work and see what happens ?

thanks,
Saravanan

Firstly, I know that killing Terracotta on server_node#1 manually, correctly shifts operations to the terracotta on server_node#2, and everything works as expected. From this I know that at least the config is correct. But as I've explained, pulling the ethernet cable from server_node#1 fails totally to continue to server pages from server_node#2, however, I don't understand these error messages:

Like I mentioned earlier, we have a "Ping Health check" running between the active and passive L2 servers and hence when you pull the network cable, it is deducted and hence you see these messages in the console, and the passive L2 becomes active,

Currently this "Ping Health check" is not present between the L1 and the L2. Hence it fails to deduct the network failure immediately. TCP should be able to deduct such failure by default if keep alives are enabled.

Doing some research revealed that the default TCP keepalives in linux are configured to be greater than 2 hrs.
Check out http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/ Also the application has to enable keep alive on the socket which I dont think we do now.

The keepalives are only applicable when the connection is idle. So I assume that if you hit some pages and perform some operation in the tomcat server it will generate some traffic to the L2 and hence the socket will throw an exception and disconnect. If your test client is not able to connect to the tomcat server, please send us the thread dumps of the tomcat server and we will look into it. This is of course an work around. We are looking at some other alternatives and will keep you posted.

thanks,
Saravanan

I dont see your tomcat instance 2 deducting a network failure to the old active server.

We constantly ping the health between the active and passive server and hence when you pull the plug, it is deducting that the server is down immediately and the passive becomes active. But between the L1 and the L2, this ping for health is not implemented due to various reasons.

So it is possible that your tomcat instance 2 is not deducting a network disruption due to long tcp timeouts or if the tcp stack has keep alive set to false. Also once you start throwing load at it, will deduct it immediately since it wont be able to write any data to the network. If it is not able to deduct the network disruption within the 2 minute client-reconnect window, the new active server will not let that L1 connect to it. Makes sense ?

To solve this for your particular usecase, you could either start throwing load at the tomcat server and let it deduct the failure fast or tune tcp stack properties to enable Keep alive and have the timeout short.

Hopefully this helps !

Did you check in the admin console ? Also please attach the entire client log as an attachment. I specifically want to see what is getting printed in the client logs after the first server goes down.

thanks,
Saravanan

From the logs, it looks like both the clients are not able to connect to the passive server after it turns active.

Code:

  2007-11-12 16:32:38,438 [Reconnect timer] INFO com.tc.objectserver.handshakemanager.ServerClientHandshakeManager - Reconnect window closing. Killing any previously connected clients that failed to connect in time: [ChannelID=[1], ChannelID=[0]]

I am not sure how virtual IP works with TC since I have not used it. Can you verify from admin console that the clients IP address shown are the original IP address and not the virtual IP address ? Also can you please send us the terracotta client logs for Tomcat server 2 ?

thanks,
Saravanan

From the logs, it looks like the server was running with default config and hence *not* running in persistent mode.

Code:

 
 <!-- This config file is used by the server when non is specified. -->
 
 <tc:tc-config xmlns:tc="http://www.terracotta.org/config">
     <system>
         <configuration-model>development</configuration-model>
     </system>
 
     <servers>
         <server host="%i">
             <data>%(user.home)/terracotta/server-data</data>
             <logs>%(user.home)/terracotta/server-logs</logs>
         </server>
     </servers>
 
   <clients>
       <logs>%(user.home)/terracotta/client-logs</logs>
   </clients>
 </tc:tc-config>

Can you please attach the server logs from this run ? From the stack trace, it does look like the server was not running in persistence mode.

thanks,
Saravanan

Hi Jamie,

Thanks for uploading the logs.

From the logs it looks like there was some transient network disruption at about 10:13 yesterday in your network which made both the machines become active. The split brain was deducted and printed in the logs. Unfortunately this kind of network failures require operator intervention to decide which L2 wins.

There are some basic JMX beans exposed that can give you the state of a server. One can write some program to notify the operator on errors like these by monitoring those beans, if needed.

cheers,
Saravanan

Hi,

We are looking at this issue. I have created a JIRA for this one.

https://jira.terracotta.org/jira/browse/CDV-502

In the meantime, you could try and run the server with persistence turned on and this problem shouldn't happen.

thanks,
Saravanan

Hi,

I dont know if it is a bug in JForums, but when I downloaded the logs, they both are for hstc01 (which started as passive and then became active at around 10:13) Here the files have different sizes, but when I downloaded it is exactly the same, I guess since the names the same, JForum gets confused.

From what I see in this log, as far as hstc01 is concerned, the other server went away (either due to a network failure or it actually crashed)

Can you please upload the logs after renaming them to different filenames ?

thanks,
Saravanan

Actually the network glitch seems to have caused both servers to become ACTIVE server.

If you see the passive server log, you can find this.

2007-10-17 09:34:21,538 [WorkerThread(group_events_dispatch_stage,0)] WARN com.terracottatech.console - NodeID[tcp://192.168.17.11:9530] left the cluster
2007-10-17 09:34:21,538 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.StateManagerImpl - Starting Election to determine cluser wide ACTIVE L2
2007-10-17 09:34:21,538 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.ElectionManagerImpl - Election Started : Enrollment [ NodeID[tcp://192.168.17.12:9530], isNew = false, weights = 4832452271299545381,-607931938336320954 ]
2007-10-17 09:34:26,538 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.ElectionManagerImpl - Election Complete : [Enrollment [ NodeID[tcp://192.168.17.12:9530], isNew = false, weights = 4832452271299545381,-607931938336320954 ]] : State[ Election-Complete ]
2007-10-17 09:34:26,538 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.StateManagerImpl - Becoming State[ ACTIVE-COORDINATOR ]

This in turn seems to have caused the deadlock. Can you please get a thread dump for both the servers and upload it ? You can create a JIRA if you like to track it.

thanks,
Saravanan

This is the current response from oracle on this issue.

=======================================

Hi Saravanan,

I have inspected the log files that you sent from customer #2., in “tcData.zip”. Indeed, the checksum exception is reported when trying to use DbPrintLog (by default it performs checksum verification):

16.10.2007 10:22:31 FileReader readNextEntry
SEVERE: Halted log file reading at file 0x0 offset 0x989294 offset(decimal)=9998996:
entry=LN_TX/0(typeNum=1,version=0)
prev=0x9891b6
size=110
Next entry should be at 0x989310
:
com.sleepycat.je.log.DbChecksumException: (JE 3.2.23) Read invalid log entry type: 0
at com.sleepycat.je.log.LogEntryHeader.<init>(LogEntryHeader.java:69)
at com.sleepycat.je.log.FileReader.readBasicHeader(FileReader.java:523)
at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:268)
at com.sleepycat.je.util.DbPrintLog.dump(DbPrintLog.java:63)
at com.sleepycat.je.util.DbPrintLog.main(DbPrintLog.java:130)

com.sleepycat.je.log.DbChecksumException: (JE 3.2.23) Read invalid log entry type: 0
at com.sleepycat.je.log.LogEntryHeader.<init>(LogEntryHeader.java:69)
at com.sleepycat.je.log.FileReader.readBasicHeader(FileReader.java:523)
at com.sleepycat.je.log.FileReader.readNextEntry(FileReader.java:268)
at com.sleepycat.je.util.DbPrintLog.dump(DbPrintLog.java:63)
at com.sleepycat.je.util.DbPrintLog.main(DbPrintLog.java:130)

Again, there is a big block of zero’s in log file 0, starting at the offset reported above, which spawns up to the end of this log file. More, this is on a remote mounted disk, so not on a local filesystem.
We are going to look close in the code where we open two file descriptors when we perform a group commit (this is the code where we
optimize how the file is flushed to disk when there are a lot of calls to Txn.commit). This is a complicated area in the code, and it will take us time to analyze, at least this week. We cannot promise for now a delivery date for something that you could try. Nevertheless, this issue is top of the list for us, and we are all looking into it at the moment.

===========================================

For the moment I would suggest that you run your servers in networked active-passive mode and write the data files to local drive.

thanks,
Saravanan

I have uploaded the db files to the existing Oracle Service request we have about this problem. Will update you once we get an answer.

thanks,
Saravanan

Actually this message is not right and shouldn't be a warning anyways. The message is already changed in the trunk. The server will not OOME because of this condition. Your problem is probably something totally different.

How much heap did you give the server ? How many clients connect to it and what is the transaction rate ? How long does the server run before OOMEing ?

Send us some more info on the environment along with the server logs and we will look into it.

thanks,
Saravanan

The fact that it works if you run both the servers in the same box and the printings in the logs all suggests some kind of network problem.

We use tribes for group communication and it uses two sockets for communication between the nodes (one for each way) I am wondering if somehow one-way communication is messed up in your network.

Can you use ncat and/or ttcp to verify that you can connect to server sockets in both directions and can send some data consistently over the socket without any exceptions ?