Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

A Quartz persistent scheduler uses database locks to prevent deadlock situations while updating its Job and Trigger
tables.
Because of this it forces a multithreaded application to sequentially perform a percentage of its work.

Our application runs on 4 nodes, with each node running a number of threads in which Quartz triggers are scheduled, fired and Jobs
are executed.
Each node has a PersistentScheduler (clustered) and a MemoryScheduler.

During performance testing of our system the following situation was observed.

At a certain point the system came under heavy stress, Garbage collection was high (80%).
In the database we observed that the Quartz sql statement:
"SELECT * FROM QRTZ_LOCKS WHERE LOCK_NAME = 'TRIGGER_ACCESS' FOR UPDATE"
is the absolute top database resource consumer.

Analyzing log files showed the following (some excerpts filtered on quartz logging):

Code:

 2010-01-26 17:00:32,325 [HTTPThreadGroup-69] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' is being obtained: HTTPThreadGroup-69
 ...
 2010-01-26 17:00:58,995 [HTTPThreadGroup-66] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: HTTPThreadGroup-66
 2010-01-26 17:00:59,070 [HTTPThreadGroup-53] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: HTTPThreadGroup-53
 2010-01-26 17:00:59,070 [HTTPThreadGroup-66] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: HTTPThreadGroup-66
 2010-01-26 17:00:59,119 [HTTPThreadGroup-53] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: HTTPThreadGroup-53
 2010-01-26 17:01:09,006 [HTTPThreadGroup-66] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' is being obtained: HTTPThreadGroup-66
 2010-01-26 17:01:09,135 [HTTPThreadGroup-55] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' is being obtained: HTTPThreadGroup-55
 2010-01-26 17:01:15,746 [HTTPThreadGroup-69] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: HTTPThreadGroup-69

HTTPThreadGroup-69 is the first to get the lock after it has been released by HTTPThreadGroup-53.
It takes about 43 seconds from the moment HTTPThreadGroup-69 obtains the lock before it is given to it.
It takes about 16 seconds before HTTPThreadGroup-69 gets the cpu time to take the lock.

Remember the system is unders stress. It look as if Quartz doesn't get enough time to fire its triggers. It start running behind.

Other logging shows that the MisfireHandler is holding the lock for long periods:
Code:

 2010-01-26 17:12:22,083 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:12:42,591 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:13:09,425 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:13:26,502 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:14:06,360 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:14:19,681 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:14:52,299 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:15:09,132 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:15:48,515 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
 2010-01-26 17:16:08,786 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler

By holding the lock for long periods the MisfireHandler blocks other quartz threads from firing triggers.
Looks like a vicious circle.

So it looks like when a system is under stress Quartz suffers and consequently aggrevates the situation.

Questions:
- Is this a fair analysis, anything we can do about it?
- When and how does Quartz get rid of these performance unfriendly Database semaphores. Can we avoid the locks for certain environments. Even when the semaphore is moved to a Terracotta grid this only
seems to shift the problem

Is this a fair analysis, anything we can do about it?

Yes. No.

It is on the Quartz "road map" to find a solution to this problem with the 2.0 release... but it will take some long/creative effort to resolve.

When and how does Quartz get rid of these performance unfriendly Database semaphores. Can we avoid the locks for certain environments. Even when the semaphore is moved to a Terracotta grid this only
seems to shift the problem

The database locks are necessary if clustering. Removing them quickly results in dead locks as multiple threads update the same database rows, but not necessarily in the same order. For example, the main scheduler thread may update a trigger's row at the same the the misfire handler is updating the trigger's row at the same time a thread executing the job related to the trigger is completing and also updating the row. If if any of those threads are also updating other rows (in the same tx) there is the potential for deadlock. THis is why after many attempts at a more efficient solution, the problem was reduced by resorting to a single lock. With the high concurrency of several threads updating virtually any row at any time, it's VERY tough to avoid deadlock any other way.

Another option for improved performance would be to use the TerracottaJobStore - which also supports clustering, and also uses a single distributed lock - but is about 10x more efficient than the database.

Another option (depending on whether it is acceptable to your application's requirements) would be to set the misfire threshold property to a higher value.

James,

For example, the main scheduler thread may update a trigger's row at the same the
the misfire handler is updating the trigger's row at the same time a thread executing the job related to the trigger
is completing and also updating the row. If if any of those threads are also updating other rows (in the same tx) there
is the potential for deadlock.

I want to understand this.
Can you be more specific. What 'other rows' are you referring to?
Lets limit the discussion to a JobStoreTx.
Is there a real life scenario you can give me ? Sequentially describing which tables are updated/inserted by which Quartz methods leading up
to the deadlock?

Another option (depending on whether it is acceptable to your application's requirements) would be to set the misfire threshold property to a higher value

I would think this would only help in cases where the system is under stress for a period (<threshold).
If the system is under stress for say a period > 10min, I would say it becomes functionally unacceptable to delay (threshold) firing the trigger
for that period.

Thanks,
Huub