A Quartz persistent scheduler uses database locks to prevent deadlock situations while updating its Job and Trigger
tables.
Because of this it forces a multithreaded application to sequentially perform a percentage of its work.
Our application runs on 4 nodes, with each node running a number of threads in which Quartz triggers are scheduled, fired and Jobs
are executed.
Each node has a PersistentScheduler (clustered) and a MemoryScheduler.
During performance testing of our system the following situation was observed.
At a certain point the system came under heavy stress, Garbage collection was high (80%).
In the database we observed that the Quartz sql statement:
"SELECT * FROM QRTZ_LOCKS WHERE LOCK_NAME = 'TRIGGER_ACCESS' FOR UPDATE"
is the absolute top database resource consumer.
Analyzing log files showed the following (some excerpts filtered on quartz logging):
Code:
2010-01-26 17:00:32,325 [HTTPThreadGroup-69] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' is being obtained: HTTPThreadGroup-69
...
2010-01-26 17:00:58,995 [HTTPThreadGroup-66] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: HTTPThreadGroup-66
2010-01-26 17:00:59,070 [HTTPThreadGroup-53] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: HTTPThreadGroup-53
2010-01-26 17:00:59,070 [HTTPThreadGroup-66] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: HTTPThreadGroup-66
2010-01-26 17:00:59,119 [HTTPThreadGroup-53] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: HTTPThreadGroup-53
2010-01-26 17:01:09,006 [HTTPThreadGroup-66] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' is being obtained: HTTPThreadGroup-66
2010-01-26 17:01:09,135 [HTTPThreadGroup-55] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' is being obtained: HTTPThreadGroup-55
2010-01-26 17:01:15,746 [HTTPThreadGroup-69] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: HTTPThreadGroup-69
HTTPThreadGroup-69 is the first to get the lock after it has been released by HTTPThreadGroup-53.
It takes about 43 seconds from the moment HTTPThreadGroup-69 obtains the lock before it is given to it.
It takes about 16 seconds before HTTPThreadGroup-69 gets the cpu time to take the lock.
Remember the system is unders stress. It look as if Quartz doesn't get enough time to fire its triggers. It start running behind.
Other logging shows that the MisfireHandler is holding the lock for long periods:
Code:
2010-01-26 17:12:22,083 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:12:42,591 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:13:09,425 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:13:26,502 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:14:06,360 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:14:19,681 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:14:52,299 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:15:09,132 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:15:48,515 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' given to: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
2010-01-26 17:16:08,786 [QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler] DEBUG org.quartz.impl.jdbcjobstore.StdRowLockSemaphore -Lock 'TRIGGER_ACCESS' returned by: QuartzScheduler_PersistentScheduler-supizas1.pcs.portinfolink.com1264516910749_MisfireHandler
By holding the lock for long periods the MisfireHandler blocks other quartz threads from firing triggers.
Looks like a vicious circle.
So it looks like when a system is under stress Quartz suffers and consequently aggrevates the situation.
Questions:
- Is this a fair analysis, anything we can do about it?
- When and how does Quartz get rid of these performance unfriendly Database semaphores. Can we avoid the locks for certain environments. Even when the semaphore is moved to a Terracotta grid this only
seems to shift the problem