Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Hi! I'm using 2 terracotta server instances for HA with permanent-store turned on.

Some time ago I needed to stop the whole cluster. So firstly all clients was killed, the standby server got down, then active server. And during this process I got an error on the standby server instance:

Code:

 ********************************** ERROR ***********************************
 * This server is running with persistence turned on and was stopped in
 * PASSIVE-STANDBY state. Only the ACTIVE-COORDINATOR server is allowed  to
 * be restarted without cleaning up the data directory with persistence
 * turned on.
 * 
 * Please clean up the data directory and make sure that the
 * ACTIVE-COORDINATOR is up and running before starting this server. It is
 * important that the ACTIVE-COORDINATOR is up and running before starting
 * this server else you might end up losing data
 ****************************************************************************
 
 java.lang.Throwable
         at com.tc.l2.ha.ClusterState.validateStartupState(ClusterState.java:69)
         at com.tc.l2.ha.ClusterState.<init>(ClusterState.java:49)
         at com.tc.l2.ha.L2HACoordinator.init(L2HACoordinator.java:99)
         at com.tc.l2.ha.L2HACoordinator.<init>(L2HACoordinator.java:92)
         at com.tc.objectserver.impl.DistributedObjectServer.start(DistributedObjectServer.java:954)
         at com.tc.server.TCServerImpl.startDSOServer(TCServerImpl.java:458)
         at com.tc.server.TCServerImpl.access$600(TCServerImpl.java:82)
         at com.tc.server.TCServerImpl$StartAction.execute(TCServerImpl.java:412)
         at com.tc.lang.StartupHelper.startUp(StartupHelper.java:39)
         at com.tc.server.TCServerImpl.startServer(TCServerImpl.java:443)
         at com.tc.server.TCServerImpl.start(TCServerImpl.java:218)
         at com.tc.server.TCServerMain.main(TCServerMain.java:28)
 2010-03-22 02:27:53,844 [main] ERROR com.terracottatech.dso - Marking the object db as dirty ...
 2010-03-22 02:27:53,848 [main] ERROR com.terracottatech.console - This standby Terracotta server instance had to restart to automatically wipe its database and rejoin the cluster.

May be the question will sound quite silly but never the less I want to ask:
is it inteded to clear stored data of standby server manually, or it will be removed automatically? What has happend with the data of my standby server instance? Was it removed automatically and then synchronized with active server, or it wasn't removed?

And the second strange thing. On client logs I have found such a record:
Code:

 2010-03-22 10:59:09,966 [TP-Processor110] WARN com.tc.object.RemoteObjectManager - ClientID[1217]: Still waiting for 30000 ms to retrieve ObjectID=[23295] de
 pth : 500 parent : ObjectID=[-1]

there lots of this records and time varites from 15000 up to 60000+ ms. I think that this causes 'lags' in app. Please can you kindly explain what exactly this records tells about?

Terracotta 3.0.1, as of 20090514-130552 (Revision 12704 by cruise@su10mo5 from 3.0)

Thanks a lot!

With regards to the restart "message" - yes it is just warning you of the behavior.

In that when a passive goes down and is then restarted, one does not know what "Terracotta transactions" it has missed. Therefore, whatever data it has on disk is moved to the side and the Terracotta Server comes up with a "blank" state. So it is possible, that when you restarted services - you restarted the passive Server first.

If a server was the Active Co-ordinator when it went down and then you restarted that first, then you would not encounter this message.

thanks for reply!

Please, can you explain what does 'depth' and 'parent' mean in warning message below? Does it mean that Terracotta tries to load shared object that have 500 included objects or it means that Terracotta have to go through 500 objects to get the object that we need?

Please, can you kindly explain what can cause such a long time load?

2010-03-22 10:59:09,966 [TP-Processor110] WARN com.tc.object.RemoteObjectManager - ClientID[1217]:
Still waiting for 30000 ms to retrieve ObjectID=[23295] depth : 500 parent : ObjectID=[-1]

As for Active and Passive(Standby) server states:
in product documentation its a record about how standby server instances synchronizes with active when the active fails:

A standby cannot become an active server instance during a failure until its state is fully synced up

But what will be the standby server behavior when active server is up, standby is restarting with manually cleared db? Is it correct that standby server will start synchronizing immediately (in background thread) also managing all current terracotta transactions? What will be the message or other kind of notification when all the previos data was synchronized? Or full sync is made only when standby server becomes active?

Thanks beforehand!

what does 'depth' and 'parent' mean

Depth is the depth of the object graph - I believe you see 500, since that is the default 'fault-count' - i.e. if you miss a reference and it needs to be looked up on the server, the current DSO implementation will "fault-in" that missing reference + 'fault-count' number of references in its vicinity. Parent I reckon is the object that references the object in question. Would need a lot more detail to be able to specifically say what was going on, which resulted in a 30s wait to retrieve a particular ObjectId. I'd say if possible move to the latest TC version (3.2.1).

Yeah Standby will come up and re-synch its state from the Active - once it finishes synchronizing, then it comes into a PASSIVE-INITIALIZED state and is now part of the cluster as Passive.

Thank you much! I'll analize code deeply to manage situation.