Our production system hangs due to the fact that one quartz thread holds a lock on the quartz_locks table (TRIGGER_ACCESS row).
We use quartz 1.6.5
I have enclosed a thread dump of all four nodes of our system taken at the time it hangs.
In the thread dump of node 4 we see the following trace:
"QuartzScheduler_PersistentScheduler-supizas4.pcs.portinfolink.com1264070532061_MisfireHandler" prio=10 tid=0x087f6800 nid=0x43a6 waiting for monitor entry [0x820fe000]
java.lang.Thread.State: BLOCKED (on object monitor)
at org.quartz.impl.jdbcjobstore.StdJDBCDelegate.selectMisfiredTriggersInStates(StdJDBCDelegate.java:311)
at org.quartz.impl.jdbcjobstore.JobStoreSupport.recoverMisfiredJobs(JobStoreSupport.java:926)
at org.quartz.impl.jdbcjobstore.JobStoreSupport.doRecoverMisfires(JobStoreSupport.java:3126)
at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.manage(JobStoreSupport.java:3887)
at org.quartz.impl.jdbcjobstore.JobStoreSupport$MisfireHandler.run(JobStoreSupport.java:3907)
The stack shows that this thread owns the lock (JobStoreSupport.doRecoverMisfires). But then seems to lock on a monitor.
This one lock brings our entire system down.
We are in the midst of doing some detailed performace testing of our system. Rather frequently the system hangs.
At that point we see a large number of threads piling up to obtain the lock on the quartz_locks table.
I have let people make two thread dumps so far just after the system hangs. You have seen one. The other has
the exact same pattern. Notice in the thread dump that in both cases a 'deadlock' occurs on a Java monitor.
But that seems unrelated to the threads that block in the quartz code?
Both quartz threads that block do so on locations where they are simply adding items to a local newly instantiated
list (StdJDBCDelegate.java:2927 and StdJDBCDelegate.java:311).
The closest synchronization I can find is the next() call on a OracleResultsetImpl (our application server is Oracle AS).
Our problem is that our tests involve quite some time and human resources to set up and execute.
Furthermore we are set to go to production with a the version under testing next week.
Tests frequently fail now due to the locked row 'TRIGGER_ACCESS' of the quartz_locks table.
Do you see a temporary workaround or extra things we could do to analyze this problem?
Our only option at the moment is to bring down the server from which the hanging database session originates.
We have not tried going back to a previous version of the jvm.
Too much diverse priorities to try this short term.
As indicated we suspected a correlation with the deadlocks that were reported in the threaddump.
Instrumentation Classes of the profiler were involved in the deadlock trace.
Therefore we tried a number of runs without the instrumentation.
We had no 'hanging' quartz locks in those runs.
However, there were some performance related concerns that have risen from those tests.
I will post a new message in this forum to address those concerns.
Hope you will respond.