[Logo] Terracotta Discussion Forums
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Messages posted by: loriente  XML
Profile for loriente -> Messages posted by loriente [37] Go to Page: Previous  1, 2, 3
Author Message
@jhouse

You can reproduce this by cranking up the number of jobs you schedule and have them with close trigger fire time (e.g. 1, 2, 3,... seconds) and manually shutdown JVM by terminating test. There is a good chance you shutdown when a trigger was in acquired state (you might need to try a couple of times but you should be able to achieve it easily).

Then restart the scheduler and observe if it recovers that trigger sitting on ACQUIRED state.

I'll look into opening a jira.

Thanks,

nicolas.loriente
@jhouse

You are in the money :) As I was reading again the documentation late last night I went over that section which it basically says to never run in cluster mode with servers that clock is not synch up and then I saw what the issue was.

Even though I knew that and our servers' clocks are synchronized I didn't have this in mind when doing some testing. This happened when some developer's machines purposely connected with persistent schedulers to join our dev quartz cluster.

We are trying to reproduce some lock time out and trigger stuck on acquired issues, for which I have separate topics in the forum, and that is why we had some developers jump in the cluster. Our dev environment is configured only with one server and, at the moment, adding another node is not an option.

We can consider this particular topic closed.

Thanks,

nicolas.loriente


By checking QRTZ_SCHEDULER_STATE we see active instances disapear from the cluster.

By checking the logs we see: This scheduler instance (INSTANCE_NAME) is still active but was recovered by another instance in the cluster.

A little later we see the instance again in QRTZ_SCHEDULER_STATE.

This is happening often and the instances are not being shut down or nothing like that. The are all the time active and for some reason Quartz believes they are not longer in the cluster and tries to recover them.

My guess is that when it is doing the recovery of jobs it realizes the instace is still active in the cluster and prints out that message to the logs.

Any particular reason why this would happened? Could this be network latency? So, Quartz expects reply from the instance when doing the cluster check and if it doesn't get it right away believes the instance is down?

This seems to add a lot of overhead as far as going through instance recovery plus I believe Quartz doens't like much getting jobs recovered by another instace when the original instance is still active.

How can we avoid this? Any ideas are very welcome.

Thanks,

nicolas.loriente
@jhouse thanks for your reply.

1. We restarted about 15 minutes after shutdown.
2. Cluster check interval is 20 seconds.
3. Instance Id AUTO.

nicolas.loriente
@jhouse thanks for your reply.

Our datasource is being created by Tomcat and looked up through Spring jndi. We are using Spring's LocalDataSourceJobStore.

I have a little more info about the issue. The problem seems to be that the application server was shutdown right when quartz was holding a lock ( SELECT * FROM SCHEMA.QRTZ_LOCKS WHERE LOCK_NAME = :1 FOR UPDATE ).

When the application is restarted it just hangs as it seems quartz tries to get the same lock the old orphaned Oracle session is currently holding. Not only that but any request from other nodes in the cluster just stay in line waiting for ever for the lock (held by orphan Oracle session) to be released.

This seems to be an issue that will occur very offen as application/nodes are shutdown or restarted (especially in dev/qa enviroments) and given that quartz does cluster checking every 20seconds. And in deed this is happening almost every single time we have a deployement to our dev environment.

How can we avoid these orphaned blocking sessions holding quartz lock when shutting down app or even when node failure?


I appreciate any advice.

Thanks,

nicolas.loriente
We have seen several times the scheduler getting stuck at startup waiting for a DB connection. In those cases we found there was already an Oracle blocking session so Quartz would sit there waiting and waiting. After DBA killed the blocking session the app and quartz scheduler started just fine.

We are running quartz-1.8.4 against Oracle 10g. Has anyone experienced a similar situation where quartz would get stuck at startup? How did you solve this issue?

What would be the reason for a lingering/already existing quartz db session?

Please let me know if I can provide more details to help identify/diagnose the issue.

Thanks,

nicolas.loriente
We had a similar issue where one of our cron triggers got stuck on ACQUIRED state in a cluster (2 nodes) environment and did not recover from that state.

Has anyone found what the issue/solution was?

We where able to replicate the issue with a simple JUnit test where we schedule about 5 jobs like 1 second apart and shut down the JVM by manually terminating the test. The JVM has to be shutdown right at the moment where the trigger is in ACQUIRED state.
When bringing back up the scheduler it does not recover those ACQUIRED triggers (they just stay stucked).
We also noticed that if run with isClustered=false quartz does recover from the ACQUIRED state eventually setting them back to WAITTING.

I'm using quartz with Spring and we have the waitForJobsToCompleteOnShutdown=true.

Using quartz 1.8.4.

We are currently working on it and debugin throug. I will post our findings if we have any progress on this
Any comments on this will be very much appreciated. Thanks!


nicolas.loriente
 
Profile for loriente -> Messages posted by loriente [37] Go to Page: Previous  1, 2, 3
Go to:   
Powered by JForum 2.1.7 © JForum Team