| Author |
Message |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/01/2010 14:55:09
|
huub
neo
Joined: 01/22/2010 11:15:19
Messages: 9
Offline
|
Recently we had an OutOfMemory error on one of our test servers.
Somehow a quart worker thread executing a periodic job got stuck.
After restart of the server the periodic (10s) job did not execute.
The state of the trigger at that time was 'ACQUIRED'.
Updating the trigger state to 'WAITING' from a sql client started the job firing again.
To reproduce this:
- Updating the trigger state of a periodic trigger to 'ACQUIRED' from a sql client when the server is down.
- Start server
This is the issue from http://jira.opensymphony.com/browse/QUARTZ-661
It is wrongly stated in that issue that there is code to cover this case.
I have looked at the code and cannot find it
This is a major issue as it is unacceptable that Quartz does not recover from a server failure.
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/01/2010 20:19:20
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
Very odd.
It certainly IS in the codebase to recover this scenario.
JobStoreSupport.schedulerStarted().recoverJobs().recoverJobs().getDelegate().updateTriggerStatesFromOtherStates(..)
-- See line #819 in the current code of JobStoreSupport (revision 1028).
Can you post your Quartz properties file?
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 01:08:27
|
huub
neo
Joined: 01/22/2010 11:15:19
Messages: 9
Offline
|
James,
As far as I can see the code you are referring to is only called from a non clustered
environment.
In a clustered environment the method clusterRecover is called.
Greetings,
Huub
| Filename |
pil_quartz_persistent.properties |
Download
|
| Description |
|
| Filesize |
2 Kbytes
|
| Downloaded: |
226 time(s) |
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 07:42:03
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
That's right, that's the primary reason I wanted to see your Quartz .properties file - to see if you were clustered or not.
I'll now be able to look further into it.
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 07:52:43
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
One more question: Are you left with a record of "acquired" state for that trigger in the fired_triggers table as well?
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 08:10:26
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
BTW: There IS code for releasing acquired triggers in clustered mode - see JobStoreSupport.clusterRecover() -- approx line # 3390.
.. Not saying it is bug free, just saying it's there.
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 08:16:57
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
To reproduce this:
- Updating the trigger state of a periodic trigger to 'ACQUIRED' from a sql client when the server is down.
- Start server
That will not reproduce the problem.
It WILL cause the trigger to be stuck (if isClustered=true). But it does not reproduce the problem, in that the real scheduler will have also made an entry in the fired_triggers table, noting the state ACQUIRED and the scheduler instance Id.
Because you did not do this, no failed node is detected, and hence the recover code does not run.
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 14:52:35
|
huub
neo
Joined: 01/22/2010 11:15:19
Messages: 9
Offline
|
James,
I was alerted to the problem a time after the server had been restarted.
At that point I did no see records in the 'fired_triggers' table.
That's why I reproduced the situation as I did.
However I do not know what was in the 'fired_triggers' table just after the crash.
I understand your point; A trigger is made 'ACQUIRED' in the same transaction an entry is made for it in
the 'fired_triggers' table. The record in the 'fired_triggers' is necessary to make clusterRecover reset the trigger
to 'WAITING'.
Indeed I have made some efforts to reproduce the case by letting the server crash at various points. But I haven't had
any succes yet.
The following facts remain:
- The server experienced an OOM.
- After the server had been restarted there was a trigger stuck in state 'ACQUIRED'. The next-fire time was before the OOM while it
was a periodic trigger (10s).
It WILL cause the trigger to be stuck (if isClustered=true). But it does not reproduce the problem, in that the real scheduler will have also made an entry in the fired_triggers table, noting the state ACQUIRED and the scheduler instance Id.
Because you did not do this, no failed node is detected, and hence the recover code does not run.
So a trigger in state 'ACQUIRED' without a fired_trigger at startup is an error situation.
Quartz however will silently startup. Maybe it should detect inconsistencies in its tables, given an error and make a trace dump?
Greetings,
Huub
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 02/02/2010 18:13:12
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
Understood.
As you and others have reported this situation occurring, I'm inclined to believe it is possible - but in reviewing the code and attempting my own "crashes" to reproduce the problem I have not yet found anything amiss. :(
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 05/10/2011 12:48:39
|
loriente
journeyman
Joined: 05/10/2011 12:10:29
Messages: 37
Offline
|
We had a similar issue where one of our cron triggers got stuck on ACQUIRED state in a cluster (2 nodes) environment and did not recover from that state.
Has anyone found what the issue/solution was?
We where able to replicate the issue with a simple JUnit test where we schedule about 5 jobs like 1 second apart and shut down the JVM by manually terminating the test. The JVM has to be shutdown right at the moment where the trigger is in ACQUIRED state.
When bringing back up the scheduler it does not recover those ACQUIRED triggers (they just stay stucked).
We also noticed that if run with isClustered=false quartz does recover from the ACQUIRED state eventually setting them back to WAITTING.
I'm using quartz with Spring and we have the waitForJobsToCompleteOnShutdown=true.
Using quartz 1.8.4.
We are currently working on it and debugin throug. I will post our findings if we have any progress on this
Any comments on this will be very much appreciated. Thanks!
nicolas.loriente
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 05/10/2011 18:58:38
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
How quickly after killing the process did you restart it? And what is your cluster check-in interval, and how are you setting your instanceIds ?
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 05/12/2011 08:44:15
|
loriente
journeyman
Joined: 05/10/2011 12:10:29
Messages: 37
Offline
|
@jhouse thanks for your reply.
1. We restarted about 15 minutes after shutdown.
2. Cluster check interval is 20 seconds.
3. Instance Id AUTO.
nicolas.loriente
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 05/13/2011 02:59:44
|
jhouse
seraphim
Joined: 11/06/2009 15:29:56
Messages: 1654
Offline
|
I'm not able to reproduce this, but a few have reported it over the last year, so it is probably worth spending some more time on as I'm inclined to believe there must be a special timing situation that can cause this.
Could you please file an issue in Jira?
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 05/13/2011 07:12:34
|
loriente
journeyman
Joined: 05/10/2011 12:10:29
Messages: 37
Offline
|
@jhouse
You can reproduce this by cranking up the number of jobs you schedule and have them with close trigger fire time (e.g. 1, 2, 3,... seconds) and manually shutdown JVM by terminating test. There is a good chance you shutdown when a trigger was in acquired state (you might need to try a couple of times but you should be able to achieve it easily).
Then restart the scheduler and observe if it recovers that trigger sitting on ACQUIRED state.
I'll look into opening a jira.
Thanks,
nicolas.loriente
|
|
|
 |
![[Post New]](/forums/templates/default/images/icon_minipost_new.gif) 05/17/2011 02:27:38
|
sparsons
neo
Joined: 05/17/2011 02:18:43
Messages: 1
Offline
|
We've seen this issue in our live system periodically as we schedule a good number of one off jobs for background processing. We've have a cluster of 2 nodes with the JobStoreTX job store managed by SchedulerFactoryBean (dontSetAutoCommitFalse = false) from Spring.
Just spent most of a day looking through the code around this and it does look like calling that recover method might fix it, but I'm highly dubious of doing that while everything is still running.
This is while running normally as well, without shutting down application servers.
|
|
|
 |
|
|