Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Hi,

I am trying to cluster my quartz jobs using the TIM for Quartz (tim-quartz-1.6.1-1.0.1.jar) together with TC 2.7.2.

I am using Spring but I am not using the TC module for Spring because I don't think I need it in my case.

I have a very simple class that creates a job at runtime and schedules it to run every few seconds. The scheduler itself is injected by Spring using the SchedulerFactoryBean.

I have two tomcats as TC nodes running, the RAMJobStore is shared among the nodes and is visible in the AdminConsole. When I store jobs via one node they get executed by the same node. When I kill (-9) this node the other node takes over and executes the jobs. So far so good.

My problem is that I cannot shutdown Tomcat gracefully when jobs are being executed and TC is used. I have to kill -9 to get rid of running jobs so that the other node takes over. What configuration detail am I missing?

Cheers
Chris

Tomcat log when shutting down:

Code:

2009-01-06 14:22:50,911 [quartzScheduler_Worker-1] DEBUG org.quartz.core.JobRunShell - Calling execute on job DEFAULT.job:http://somewhere
 Execute job at Tue Jan 06 14:22:50 GMT 2009
 2009-01-06 14:22:53,816 [quartzScheduler_Worker-2] DEBUG org.quartz.core.JobRunShell - Calling execute on job DEFAULT.job:http://somewhere
 Execute job at Tue Jan 06 14:22:53 GMT 2009
 2009-01-06 14:22:56,815 [quartzScheduler_Worker-3] DEBUG org.quartz.core.JobRunShell - Calling execute on job DEFAULT.job:http://somewhere
 Execute job at Tue Jan 06 14:22:56 GMT 2009
 2009-01-06 14:22:59,815 [quartzScheduler_Worker-4] DEBUG org.quartz.core.JobRunShell - Calling execute on job DEFAULT.job:http://somewhere
 Execute job at Tue Jan 06 14:22:59 GMT 2009
 2009-01-06 14:23:02,816 [quartzScheduler_Worker-5] DEBUG org.quartz.core.JobRunShell - Calling execute on job DEFAULT.job:http://somewhere
 Execute job at Tue Jan 06 14:23:02 GMT 2009
 2009-01-06 14:23:05,816 [quartzScheduler_Worker-6] DEBUG org.quartz.core.JobRunShell - Calling execute on job DEFAULT.job:http://somewhere
 Execute job at Tue Jan 06 14:23:05 GMT 2009
 06-Jan-2009 14:23:06 org.apache.coyote.http11.Http11Protocol pause
 INFO: Pausing Coyote HTTP/1.1 on http-8080
 06-Jan-2009 14:23:06 org.apache.coyote.http11.Http11Protocol pause
 INFO: Pausing Coyote HTTP/1.1 on http-8443
 06-Jan-2009 14:23:07 org.apache.catalina.core.StandardService stop
 INFO: Stopping service Catalina
 2009-01-06 14:23:07,644 [Thread-21] INFO  org.springframework.web.context.support.XmlWebApplicationContext - Closing org.springframework.web.context.support.XmlWebApplicationContext@18c924b: display name [Root WebApplicationContext]; startup date [Tue Jan 06 14:22:44 GMT 2009]; root of context hierarchy
 2009-01-06 14:23:07,646 [Thread-21] INFO  org.quartz.core.QuartzScheduler - Scheduler quartzScheduler_$_NON_CLUSTERED paused.
 2009-01-06 14:23:07,646 [Thread-21] INFO  org.springframework.beans.factory.support.DefaultListableBeanFactory - Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@6ea37a: defining beans [org.springframework.aop.config.internalAutoProxyCreator,identificationAspect,messagesResource,
 registrationResource,identityInterceptor,customerIdentityResolver,customerFactory,
 failedToIdentifyAnalyser,failedToReauthenticateAnalyser,anonymousCookieStrategy,
 identityWithheldCookieStrategy,identificationRequestProxy,identificationWorker,customerIdentityManager,
 customerIdentityDaoStub,sdpConcurrenyControl,customerIdentityDaoProd,sdpGetConsumerCredentialAnalyser,
 sdpCreateSessionAnalyser,identificationMessageTimeoutAnalyser,identificationExecutor,cookieIdentityEncoder,
 longDurationCookieIdentityEncoder,cacheManager,customerCache,org.springframework.beans.factory.config.
 PropertyPlaceholderConfigurer#0,propertiesFiles,additionalPropertiesFiles,quartzScheduler,
 resourceUpdateScheduler,
 customerDao,customerDaoDelegate,customerDaoStub]; root of factory hierarchy
 2009-01-06 14:23:07,647 [Thread-21] INFO  org.springframework.scheduling.quartz.SchedulerFactoryBean - Shutting down Quartz Scheduler
 2009-01-06 14:23:07,647 [Thread-21] INFO  org.quartz.core.QuartzScheduler - Scheduler quartzScheduler_$_NON_CLUSTERED shutting down.
 2009-01-06 14:23:07,647 [Thread-21] INFO  org.quartz.core.QuartzScheduler - Scheduler quartzScheduler_$_NON_CLUSTERED paused.
 2009-01-06 14:23:07,817 [quartzScheduler_Worker-3] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,817 [quartzScheduler_Worker-5] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,818 [quartzScheduler_Worker-2] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,818 [quartzScheduler_Worker-6] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,817 [quartzScheduler_Worker-4] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,913 [quartzScheduler_Worker-1] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,949 [quartzScheduler_Worker-7] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,949 [quartzScheduler_Worker-8] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,949 [quartzScheduler_Worker-10] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down
 2009-01-06 14:23:07,949 [quartzScheduler_Worker-9] DEBUG org.quartz.simpl.SimpleThreadPool - WorkerThread is shutting down

The Spring beans:

Code:

	<bean id="quartzScheduler" class="org.springframework.scheduling.quartz.SchedulerFactoryBean"
 		destroy-method="destroy">
 		<property name="waitForJobsToCompleteOnShutdown"><value>false</value></property>
 		<!-- <property name="configLocation"><value>classpath:quartz.properties</value></property> -->
 	</bean>
 	
 	<bean id="resourceUpdateScheduler" class="tmnotifications.service.ResourceUpdateScheduler">
 		<property name="scheduler" ref="quartzScheduler"/>
 	</bean>

The job is not doing much at the moment but printing sysdate:

Code:

import java.util.Date;
 
 import org.quartz.Job;
 import org.quartz.JobExecutionContext;
 import org.quartz.JobExecutionException;
 
 public class DumbJob implements Job {
 
 	@Override
 	public void execute(JobExecutionContext arg0) throws JobExecutionException {
 		
 		Date now = new Date();
 		System.out.println("Execute job at " + now);
 
 	}
 
 }

And the class that schedules the job with the trigger.

Code:

public class ResourceUpdateScheduler {
 
 	private static final Log logger = LogFactory.getLog(ResourceUpdateScheduler.class);
 	
 	Scheduler scheduler;
 	
 	public void scheduleResourceUpdate(ResourceObserver resourceObserver){
 		
 		JobDetail jobDetail = new JobDetail("job:"+resourceObserver.getUrl(),
 				Scheduler.DEFAULT_GROUP,
                 DumbJob.class);
 
 		try {
 			Trigger trigger = TriggerUtils.makeSecondlyTrigger(resourceObserver.getPollingFrequency(), 20);
 			scheduler.scheduleJob(jobDetail, trigger);
 		} catch (SchedulerException e) {
 			logger.error("scheduling failed",e);
 		}
 	}
 
 	public void setScheduler(Scheduler scheduler) {
 		this.scheduler = scheduler;
 	}
 	
 }

The tc-config.xml:

Code:

<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
 	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 	xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-4.xsd">
 
 	<servers>
 		<!-- 
 			First Terracotta server
 		-->
 		<server host="10.40.205.156" name="tc-server1">
 			<dso-port>9510</dso-port>
 			<jmx-port>9520</jmx-port>
 			<l2-group-port>9030</l2-group-port>
 			<data>/opt/terracotta/server-data/localhost</data>
 			<logs>/opt/terracotta/logs/localhost.log</logs>
 			<statistics>/opt/terracotta/statistics</statistics>
 			<dso>
 				<persistence>
 					<mode>permanent-store</mode>
 				</persistence>
 			</dso>
 		</server>
 		<!-- 
 			Second Terracotta server
 -->
 		
 		<server host="10.40.205.45" name="tc-server2">
 			<dso-port>9510</dso-port>
 			<jmx-port>9520</jmx-port>
 			<l2-group-port>9030</l2-group-port>
 			<data>/opt/terracotta/server-data/localhost</data>
 			<logs>/opt/terracotta/logs/localhost.log</logs>
 			<statistics>/opt/terracotta/statistics</statistics>
 			<dso>
 				<persistence>
 					<mode>permanent-store</mode>
 				</persistence>
 			</dso>
 		</server>
 		<ha>
 			<mode>networked-active-passive</mode>
 			<networked-active-passive>
 				<election-time>5</election-time>
 			</networked-active-passive>
 		</ha>
 
 		<update-check>
 			<enabled>true</enabled>
 		</update-check>
 
 	</servers>
 
 	<clients>
 		<statistics>
 			%(user.home)/terracotta/client-stats/%(tc.node-name)/
 		</statistics>
 		<logs>
 			%(user.home)/terracotta/client-logs/%(tc.node-name)/
 		</logs>
 		<dso>
 			<debugging>
 				<instrumentation-logging>
 					<class>true</class>
 					<hierarchy>true</hierarchy>
 					<locks>true</locks>
 					<transient-root>true</transient-root>
 					<roots>true</roots>
 					<distributed-methods>true</distributed-methods>
 				</instrumentation-logging>
 				<runtime-logging>
 					<lock-debug>true</lock-debug>
 					<field-change-debug>true</field-change-debug>
 					<wait-notify-debug>true</wait-notify-debug>
 					<distributed-method-debug>
 						true
 					</distributed-method-debug>
 					<new-object-debug>true</new-object-debug>
 				</runtime-logging>
 			</debugging>
 		</dso>
 		<modules>
       			 <module name="modules-common" version="2.7.2"/>
        			 <module name="spring-config" version="2.7.2"/>
 			<module group-id="org.terracotta.modules" name="tim-quartz-1.6.1" version="1.0.1"/>
 		</modules>
 	</clients>
 
 	<application>     
 	     <dso> 
 	       <instrumented-classes> 
 		 <include>
 		   <class-expression>tmnotifications.domain..*</class-expression>
 		 </include>
 	       </instrumented-classes>
 	     </dso>
 	   </application>
 
 </tc:tc-config>

I have described my motivation in the following postings:

http://forums.terracotta.org/forums/posts/list/1363.page
http://forums.terracotta.org/forums/posts/list/1371.page

The election time is used in my application in a listener to calculate the appropriate delay for the timer that disables my clustered ehcache. I managed to get the config as xml via JMX (L1 Info Bean) and then I parse the xml to get the value. That works fine.

What I am trying to do is bypass the EHCache if the clients are permanently disconnected from the cluster. Therefore I have implemented a listener that starts a timer when the client gets the disconnect event on JMX and disables the cache if the timer fires off before the reconnect event is delivered. I am not shutting down the application I am just bypassing the cache. And that might be the problem.

If I understand you correctly I have to restart the application without the TC libraries to stop the TC libraries from misbehaving on the network.
Is there by any chance a way to achieve that without an application restart? Am I correct to think that the Enterprise versions quarantine feature doesn't require a restart?

Christian

I would like to know if it is possible to manually keep the client off from trying to connect and opening hundreds of connectsions to the server. I tried different operations in the JMX console but couldn't get the client to stop.

The reason I want to do this:
I have implemented a listener that disables the clustered ehcache to allow the application to work without Terracotta in case it is permanently disconnected. The app works fine without TC but in the background the TC client opens these hundreds of connects.

Any ideas?

Cheers

Christian

Hi

In my web application I would like to read the value of the election-time property of the tc-config.xml that is used. I understand that I might use JMX for that purpose. What is the easiest/best practice way of doing this?

Cheers

Christian

Thanks for your help.

I think I have isolated the problem.

In our application we are clustering the EHCache and to have a way to disable the cache with Terracotta if the clients are disconnected from the cluster we have implemented a JMX listener. (extending the DefaultListener in jmx-util-1.2.0-SNAPSHOT)

This listener listens for node disconnect events and disables the ehcache so that the application can run without Terracotta. Apparently this disconnect event happens as soon as the client is disconnected from the active server. That means the ehcache is disabled to early and the reconnect event doesn't enable the cache properly. What I would like to have is that the listener only disables the ehcache when the client is disconnected from the cluster after the active server election time (5 s) has passed. So that the passive server has time to take over and the clients can reconnect.

Scenario 1:
server1 active
server2 passive
server1 dies
server2 becomes active within 5 seconds
clients wait for the new server to become active
listener doesn't disable ehcache

Scenario 2:
server1 active
server2 passive
server1 dies
server2 dies
5 seconds pass by
clients get cluster disconnect event
listener disables ehcache
application runs without cache and Terracotta

Is there a JMX event I could use to achieve what I have described?
Do you have any other idea to solve my problem?

Any suggestions are very welcome.

Regards

Christian

The network connectivity looks correct. There is no firewall between servers and clients as I said before can successfully connect to both servers. It is just that the reconnect after the failover that doesn't work. According to the TCP dump the clients are establishing successfully TCP connections to the new active server but they aren't using it. Instead they are opening more and more and after a while we have hundreds of open TCP connections.

We tried it both ways. With server1 active and with server2 active and with and without restarting the stopped server. We always got the same result. The first time the clients connect to the cluster it works (with both servers so no connectivity issue). But if failover happens and the active server changes the clients do not reconnect to the other server. The get disconnected before the reconnect window closes.

Sorry .. the server1 log file was not the right one. I have attached the actual one.

The client log file is ok it just shows a different timezone so you have to add 1 hour. Sorry for the confusion.

The crucial part is ..

server1:
Code:

2008-09-05 13:05:48,559 [CommonShutDownHook] INFO com.terracottatech.dso - L2 Exiting...

server2:
Code:

2008-09-05 13:05:48,975 [TCComm Main Selector Thread (listen 0.0.0.0:9030)] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: TCGroupManager - Connection to [drukrf36:9030] DISCONNECTED. Health Monitoring for this node is now disabled.
 2008-09-05 13:05:48,976 [TCComm Main Selector Thread (listen 0.0.0.0:9030)] INFO com.tc.net.protocol.tcm.ClientMessageChannel - ClientMessageChannel moves to SessionID=[1]
 2008-09-05 13:05:48,976 [TCComm Main Selector Thread (listen 0.0.0.0:9030)] WARN com.tc.net.groups.TCGroupMemberDiscoveryStatic - Ignore switching Node{host=rf36:9030}:Member-In-Group from Connecting to Init
 2008-09-05 13:05:48,976 [TCComm Main Selector Thread (listen 0.0.0.0:9030)] WARN com.tc.net.groups.TCGroupMemberDiscoveryStatic - Ignore switching Node{host=rf36:9030}:Member-In-Group from Connected to Init
 2008-09-05 13:05:48,977 [WorkerThread(group_events_dispatch_stage,0)] WARN com.tc.l2.ha.L2HACoordinator - NodeID[rf36:9030] left the cluster
 2008-09-05 13:05:48,977 [WorkerThread(group_events_dispatch_stage,0)] WARN com.terracottatech.console - NodeID[rf36:9030] left the cluster
 2008-09-05 13:05:48,977 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.StateManagerImpl - Starting Election to determine cluser wide ACTIVE L2
 2008-09-05 13:05:48,978 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.ElectionManagerImpl - Election Started : Enrollment [ NodeID[rf37:9030], isNew = false, weights = -2204084578683330112,798024125280138313 ]
 2008-09-05 13:05:53,980 [WorkerThread(group_discovery_stage,1)] WARN com.tc.net.groups.TCGroupMemberDiscoveryStatic - Node:Node{host=rf36:9030} java.net.ConnectException: Connection refused
 2008-09-05 13:05:53,981 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.ElectionManagerImpl - Election Complete : [Enrollment [ NodeID[rf37:9030], isNew = false, weights = -2204084578683330112,798024125280138313 ]] : State[ Election-Complete ]
 2008-09-05 13:05:53,982 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.StateManagerImpl - Becoming State[ ACTIVE-COORDINATOR ]
 2008-09-05 13:05:53,982 [WorkerThread(group_events_dispatch_stage,0)] INFO com.terracottatech.console - Becoming State[ ACTIVE-COORDINATOR ]
 2008-09-05 13:05:53,982 [WorkerThread(group_events_dispatch_stage,0)] INFO com.tc.l2.state.ElectionManagerImpl - Declared as Winner: Winner is : Enrollment [ NodeID[rf37:9030], isNew = false, weights = -2204084578683330112,798024125280138313 ]
 2008-09-05 13:05:53,992 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.l2.ha.ClusterState - Setting the Next Available OID to 181000
 2008-09-05 13:05:53,994 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.l2.ha.ClusterState - Setting the Next Available Global Transaction ID to 20001
 2008-09-05 13:05:53,997 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.objectserver.tx.ServerTransactionManager - Waiting for txns to complete
 2008-09-05 13:05:53,997 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.objectserver.tx.ServerTransactionManager - No more txns in the system.
 2008-09-05 13:05:53,998 [WorkerThread(l2_state_change_stage,0)] INFO com.terracottatech.console - Starting reconnect window: 120000 ms.
 2008-09-05 13:05:53,998 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.net.protocol.transport.ServerStackProvider - Preparing comms stack for previously connected client: ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64)
 2008-09-05 13:05:54,001 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.net.protocol.transport.ServerStackProvider - Preparing comms stack for previously connected client: ConnectionID(0.878bcaf5e6934b279b6ddca247d8dc64)
 2008-09-05 13:05:54,001 [WorkerThread(l2_state_change_stage,0)] INFO com.tc.net.protocol.transport.ServerStackProvider - Preparing comms stack for previously connected client: ConnectionID(1.878bcaf5e6934b279b6ddca247d8dc64)
 2008-09-05 13:05:54,001 [WorkerThread(l2_state_change_stage,0)] INFO com.terracottatech.console - Terracotta Server has started up as ACTIVE node on 0.0.0.0:9510 successfully, and is now ready for work.

client:
Code:

2008-09-05 14:05:48,974 [TCComm Main Selector Thread] INFO com.tc.net.protocol.transport.ConnectionHealthCheckerImpl: DSO Client - Connection to [drukrf36:9510] DISCONNECTED. Health Monitoring for this node is now disabled.
 2008-09-05 14:05:48,975 [TCComm Main Selector Thread] INFO com.tc.net.protocol.tcm.ClientMessageChannel - ClientMessageChannel moves to SessionID=[1]
 2008-09-05 14:05:48,976 [TCComm Main Selector Thread] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: LOCK_REQUEST_MESSAGE (9) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:05:48,977 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Reconnect attempt 0 of unlimited reconnect tries to rf36:9510, timeout=10000
 2008-09-05 14:05:48,985 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:49,987 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Reconnect attempt 0 of unlimited reconnect tries to rf37:9510, timeout=10000
 2008-09-05 14:05:49,989 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:50,990 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:51,993 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:52,357 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:05:52,995 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:53,997 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:54,999 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:05:56,001 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Attaching new connection: com.tc.net.core.TCConnectionJDK14@712800881: connected: true, closed: false local=192.168.12.184:44987 remote=192.168.12.158:9510 connect=[Fri Sep 05 14:05:56 BST 2008] idle=0ms [0 read, 0 write]
 2008-09-05 14:05:57,273 [DSO Lock Object.wait() timer] INFO com.tc.object.lockmanager.api.ClientLockManager - ChannelID=[2]: Running lock GC took 0 ms 1iterations for GCing 0 locks. gcCandidates remaining = 0 total locks remaining = 1
 2008-09-05 14:05:57,360 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:02,362 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:06,106 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Timeout of 10000 milliseconds occured
 2008-09-05 14:06:07,109 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:06:07,364 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:08,111 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Attaching new connection: com.tc.net.core.TCConnectionJDK14@59144625: connected: true, closed: false local=192.168.12.184:44996 remote=192.168.12.158:9510 connect=[Fri Sep 05 14:06:08 BST 2008] idle=0ms [0 read, 0 write]
 2008-09-05 14:06:12,365 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:17,367 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:18,216 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Timeout of 10000 milliseconds occured
 2008-09-05 14:06:19,220 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:06:20,224 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Attaching new connection: com.tc.net.core.TCConnectionJDK14@1993667015: connected: true, closed: false local=192.168.12.184:45001 remote=192.168.12.158:9510 connect=[Fri Sep 05 14:06:20 BST 2008] idle=0ms [0 read, 0 write]
 2008-09-05 14:06:22,370 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:27,372 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:30,329 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Timeout of 10000 milliseconds occured
 2008-09-05 14:06:31,332 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:06:32,335 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Attaching new connection: com.tc.net.core.TCConnectionJDK14@1077311193: connected: true, closed: false local=192.168.12.184:45010 remote=192.168.12.158:9510 connect=[Fri Sep 05 14:06:32 BST 2008] idle=0ms [0 read, 0 write]
 2008-09-05 14:06:32,375 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:37,377 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:42,379 [RemoteTransactionManager Flusher] INFO com.tc.net.protocol.tcm.ClientMessageChannel - Drop old message: COMPLETED_TRANSACTION_LOWWATERMARK_MESSAGE (45) Expected SessionID=[1] but got SessionID=[-1]
 2008-09-05 14:06:42,440 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Timeout of 10000 milliseconds occured
 2008-09-05 14:06:43,443 [ConnectionEstablisher] WARN com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Connection refused
 2008-09-05 14:06:44,445 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Attaching new connection: com.tc.net.core.TCConnectionJDK14@378259625: connected: true, closed: false local=192.168.12.184:45016 remote=192.168.12.158:9510 connect=[Fri Sep 05 14:06:44 BST 2008] idle=0ms [0 read, 0 write]

As you can see the failover of the servers works fin. When server1 exits server2 takes over and becomes active.
When you look in the client log you see that the client gets disconnected from server1 and tries to reconnect many times.
Apparently at this line
Code:

2008-09-05 14:05:56,001 [ConnectionEstablisher] INFO com.tc.net.protocol.transport.ClientMessageTransport - ConnectionID(2.878bcaf5e6934b279b6ddca247d8dc64): Attaching new connection: com.tc.net.core.TCConnectionJDK14@712800881: connected: true, closed: false local=192.168.12.184:44987 remote=192.168.12.158:9510 connect=[Fri Sep 05 14:05:56 BST 2008] idle=0ms [0 read, 0 write]

the client connects to server2.
Why is the client not using this connection? It is established 7 seconds after disconnect so perfectly within the 2 min reconnect window?
Why is it opening further connections?

When you look the the tcp dump it shows the same situation. Client opening many connections without using it. Very similar to the problem described in this other thread (see link above)

BTW: you are right there were 3 clients involved

Here are both server logs

I am having a very similar problem. See this thread http://forums.terracotta.org/forums/posts/list/1371.page

And I am using the same version 2.6.2.

Hi

I am testing the failover with 2 Terracotta servers (2.6.2) and and 2 clients (tomcat 6). All on separate machines. When I kill the active server the passive takes over and becomes active. Works fine. But the tomcat clients do not reconnect to the new active server. They are trying to connect and every few seconds the tcp connections to the new active servers seems to be established when I looked at the tcp communication. But for reasons I don't understand the clients keep trying to connect to the server and hence opening hundreds of connections. I think it is similar to the problem described in this thread http://forums.terracotta.org/forums/posts/list/1363.page

The clients don't realise that they already have a connection. No data is sent over these established connections.

Any ideas why this is happing?

Cheers

Chris

It would be helpful to see the configuration file. Could you post the tc-config.xml file under C:\terracotta-2.6.2\?

Chris

No problem.

https://jira.terracotta.org/jira/browse/CDV-806

Cheers
Chris

I figured out how to use a dedicated node name and avoid the problems mentioned before witht he ClusterBeanRegistrationListener. You just have to avoid the use of tc.node-name as JVM property name and use instead a name like "node" to run two tc clients on the same host machine.

In my catalina.sh script I have this line

export CATALINA_OPTS=$CATALINA_OPTS" -Xdebug -Dnode=node1 -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=n"

And in my tc-config.xml I have the following line that uses this property

..
<clients>
<statistics>%(user.home)/terracotta/client-stats/%(node)/</statistics>
<logs>%(user.home)/terracotta/client-logs/%(node)/</logs>
..

Apparently only the JVM property tc.node-name changes the object name of the ClusterBean.

Perhaps it would be good to change the warning message that appears when you run several TC clients on the same machine without setting a JVM property for distinction.

**************************************************************************************
The statistics buffer couldn't be opened at
'/home/cmacher/terracotta/client-stats'.
The CVT system will not be active for this node.

A common reason for this is that you're launching several Terracotta L1
clients on the same machine. The default directory for the statistics buffer
uses the IP address of the machine that it runs on as the identifier.
When several clients are being executed on the same machine, a typical solution
to properly separate these directories is by using a JVM property at startup
that is unique for each client.

For example:
dso-java.sh -Dtc.node-name=node1 your.main.Class

You can then adapt the tc-config.xml file so that this JVM property is picked
up when the statistics directory is configured by using %(tc.node-name) in the
statistics path.
**************************************************************************************