[Logo] Terracotta Discussion Forums
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Quartz trigger stuck in ACQUIRED state  XML
Forum Index -> Quartz Go to Page: 1, 2 Next 
Author Message
huub

neo

Joined: 01/22/2010 11:15:19
Messages: 9
Offline

Recently we had an OutOfMemory error on one of our test servers.
Somehow a quart worker thread executing a periodic job got stuck.
After restart of the server the periodic (10s) job did not execute.

The state of the trigger at that time was 'ACQUIRED'.
Updating the trigger state to 'WAITING' from a sql client started the job firing again.

To reproduce this:
- Updating the trigger state of a periodic trigger to 'ACQUIRED' from a sql client when the server is down.
- Start server

This is the issue from http://jira.opensymphony.com/browse/QUARTZ-661
It is wrongly stated in that issue that there is code to cover this case.
I have looked at the code and cannot find it

This is a major issue as it is unacceptable that Quartz does not recover from a server failure.
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


Very odd.

It certainly IS in the codebase to recover this scenario.

JobStoreSupport.schedulerStarted().recoverJobs().recoverJobs().getDelegate().updateTriggerStatesFromOtherStates(..)

-- See line #819 in the current code of JobStoreSupport (revision 1028).


Can you post your Quartz properties file?
huub

neo

Joined: 01/22/2010 11:15:19
Messages: 9
Offline

James,

As far as I can see the code you are referring to is only called from a non clustered
environment.
In a clustered environment the method clusterRecover is called.

Greetings,
Huub
 Filename pil_quartz_persistent.properties [Disk] Download
 Description
 Filesize 2 Kbytes
 Downloaded:  229 time(s)

jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


That's right, that's the primary reason I wanted to see your Quartz .properties file - to see if you were clustered or not.

I'll now be able to look further into it.
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


One more question: Are you left with a record of "acquired" state for that trigger in the fired_triggers table as well?
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


BTW: There IS code for releasing acquired triggers in clustered mode - see JobStoreSupport.clusterRecover() -- approx line # 3390.

.. Not saying it is bug free, just saying it's there.
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


To reproduce this:
- Updating the trigger state of a periodic trigger to 'ACQUIRED' from a sql client when the server is down.
- Start server
 


That will not reproduce the problem.

It WILL cause the trigger to be stuck (if isClustered=true). But it does not reproduce the problem, in that the real scheduler will have also made an entry in the fired_triggers table, noting the state ACQUIRED and the scheduler instance Id.

Because you did not do this, no failed node is detected, and hence the recover code does not run.
huub

neo

Joined: 01/22/2010 11:15:19
Messages: 9
Offline

James,

I was alerted to the problem a time after the server had been restarted.
At that point I did no see records in the 'fired_triggers' table.
That's why I reproduced the situation as I did.
However I do not know what was in the 'fired_triggers' table just after the crash.

I understand your point; A trigger is made 'ACQUIRED' in the same transaction an entry is made for it in
the 'fired_triggers' table. The record in the 'fired_triggers' is necessary to make clusterRecover reset the trigger
to 'WAITING'.
Indeed I have made some efforts to reproduce the case by letting the server crash at various points. But I haven't had
any succes yet.

The following facts remain:
- The server experienced an OOM.
- After the server had been restarted there was a trigger stuck in state 'ACQUIRED'. The next-fire time was before the OOM while it
was a periodic trigger (10s).


It WILL cause the trigger to be stuck (if isClustered=true). But it does not reproduce the problem, in that the real scheduler will have also made an entry in the fired_triggers table, noting the state ACQUIRED and the scheduler instance Id.

Because you did not do this, no failed node is detected, and hence the recover code does not run.
 

So a trigger in state 'ACQUIRED' without a fired_trigger at startup is an error situation.
Quartz however will silently startup. Maybe it should detect inconsistencies in its tables, given an error and make a trace dump?

Greetings,
Huub
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


Understood.

As you and others have reported this situation occurring, I'm inclined to believe it is possible - but in reviewing the code and attempting my own "crashes" to reproduce the problem I have not yet found anything amiss. :(
loriente

journeyman

Joined: 05/10/2011 12:10:29
Messages: 37
Offline

We had a similar issue where one of our cron triggers got stuck on ACQUIRED state in a cluster (2 nodes) environment and did not recover from that state.

Has anyone found what the issue/solution was?

We where able to replicate the issue with a simple JUnit test where we schedule about 5 jobs like 1 second apart and shut down the JVM by manually terminating the test. The JVM has to be shutdown right at the moment where the trigger is in ACQUIRED state.
When bringing back up the scheduler it does not recover those ACQUIRED triggers (they just stay stucked).
We also noticed that if run with isClustered=false quartz does recover from the ACQUIRED state eventually setting them back to WAITTING.

I'm using quartz with Spring and we have the waitForJobsToCompleteOnShutdown=true.

Using quartz 1.8.4.

We are currently working on it and debugin throug. I will post our findings if we have any progress on this
Any comments on this will be very much appreciated. Thanks!


nicolas.loriente
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline


How quickly after killing the process did you restart it? And what is your cluster check-in interval, and how are you setting your instanceIds ?
loriente

journeyman

Joined: 05/10/2011 12:10:29
Messages: 37
Offline

@jhouse thanks for your reply.

1. We restarted about 15 minutes after shutdown.
2. Cluster check interval is 20 seconds.
3. Instance Id AUTO.

nicolas.loriente
jhouse

seraphim
[Avatar]
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline

I'm not able to reproduce this, but a few have reported it over the last year, so it is probably worth spending some more time on as I'm inclined to believe there must be a special timing situation that can cause this.

Could you please file an issue in Jira?
loriente

journeyman

Joined: 05/10/2011 12:10:29
Messages: 37
Offline

@jhouse

You can reproduce this by cranking up the number of jobs you schedule and have them with close trigger fire time (e.g. 1, 2, 3,... seconds) and manually shutdown JVM by terminating test. There is a good chance you shutdown when a trigger was in acquired state (you might need to try a couple of times but you should be able to achieve it easily).

Then restart the scheduler and observe if it recovers that trigger sitting on ACQUIRED state.

I'll look into opening a jira.

Thanks,

nicolas.loriente
sparsons

neo

Joined: 05/17/2011 02:18:43
Messages: 1
Offline

We've seen this issue in our live system periodically as we schedule a good number of one off jobs for background processing. We've have a cluster of 2 nodes with the JobStoreTX job store managed by SchedulerFactoryBean (dontSetAutoCommitFalse = false) from Spring.

Just spent most of a day looking through the code around this and it does look like calling that recover method might fix it, but I'm highly dubious of doing that while everything is still running.

This is while running normally as well, without shutting down application servers.
 
Forum Index -> Quartz Go to Page: 1, 2 Next 
Go to:   
Powered by JForum 2.1.7 © JForum Team