Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Hi,

using ehcache 2.3.0 standalone. we have strange problems under heavy concurrent access with blocking cache. some threads will never wake up and remain in waiting state. this will crash our system.

thread dump:

Thread: ajp-0.0.0.0-8010-2 : priority:5, demon:true, threadId:129, threadState:WAITING, lockName:java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@6bad4311

sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
net.sf.ehcache.concurrent.ReadWriteLockSync.lock(ReadWriteLockSync.java:53)
net.sf.ehcache.constructs.blocking.BlockingCache.put(BlockingCache.java:204)
de.company.webdb.caching.CacheServiceBean.put(CacheServiceBean.java:166)

we have over 200 thread with same state !!!!

any ideas ?

ehcache 1.6.2 will work under same scenario with no problems !

using java:
Java HotSpot(TM) 64-Bit Server VM (build 16.0-b13, mixed mode)

We have noticed similar behavior under certain circumstances.
But are the 200 threads waiting for the write lock?
We are currently evaluating what's the best way to address that, so your input is more than welcome.
Thanks!

yes. all waiting for the write lock.
What more informations you need?

what do prefer for a workaround? fallback to 1.6.2 ?

we are planing to use jgrouprepliaction in future. it's possible to use this feature with 1.6.2 ?

May be a JVM issue. Please using JDK_1.6.0.21 or higher.

Refer following links:

https://jira.terracotta.org/jira/browse/DEV-4685

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6822370

If that works out for you, please let us know.
Thanks!

I have seen this problem to occur on quad-core + quad-socket systems under high load (1000 concurrent threads get stuck). It appears to be linked to a JVM bug ( should be fixed in JDK_1.6.0.18 or higher )

Temporary workaround could be to use the -XX:+UseMembar parameter ... seemed to help in some cases ( if upgrading the JDK is not an option).

In any case let us know how you go ...

thx for the help.

we will try first newest jdk1.6.22 and maybe then the vm hint -XX:+UseMembar.

I'll report results

I, too, am having the exact same issue:

"jrpp-733" prio=5 tid=1194 WAITING
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(Unknown Source)
at com.tc.object.locks.LockStateNode$PendingLockHold.park(LockStateNode.java:172)
at com.tc.object.locks.ClientLockImpl.acquireQueued(ClientLockImpl.java:731)
at com.tc.object.locks.ClientLockImpl.acquireQueued(ClientLockImpl.java:710)
at com.tc.object.locks.ClientLockImpl.lock(ClientLockImpl.java:50)
at com.tc.object.locks.ClientLockManagerImpl.lock(ClientLockManagerImpl.java:97)
at com.tc.object.bytecode.ManagerImpl.lock(ManagerImpl.java:728)
at com.tc.object.bytecode.ManagerUtil.beginLock(ManagerUtil.java:208)
at org.terracotta.collections.BasicLockStrategy.beginLock(BasicLockStrategy.java:12)
at org.terracotta.collections.ConcurrentDistributedMapDso.beginLock(ConcurrentDistributedMapDso.java:964)
at org.terracotta.collections.ConcurrentDistributedMapDso.get(ConcurrentDistributedMapDso.java:181)
at org.terracotta.collections.ConcurrentDistributedMapDsoArray.get(ConcurrentDistributedMapDsoArray.java:154)
at org.terracotta.collections.ConcurrentDistributedMap.get(ConcurrentDistributedMap.java:165)
at org.terracotta.cache.impl.DistributedCacheImpl.getNonExpiredEntry(DistributedCacheImpl.java:175)
at org.terracotta.cache.impl.DistributedCacheImpl.getNonExpiredEntryCoherent(DistributedCacheImpl.java:115)
at org.terracotta.cache.impl.DistributedCacheImpl.getTimestampedValue(DistributedCacheImpl.java:153)
at org.terracotta.modules.ehcache.store.ClusteredStore.get(ClusteredStore.java:210)
at net.sf.ehcache.Cache.searchInMemoryStoreWithStats(Cache.java:1695)
at net.sf.ehcache.Cache.get(Cache.java:1335)
at net.sf.ehcache.Cache.get(Cache.java:1306)
at coldfusion.tagext.io.cache.ehcache.GenericEhcache.get(GenericEhcache.java:75)
at coldfusion.tagext.io.cache.CacheTagHelper.getFromCache(CacheTagHelper.java:237)
at coldfusion.runtime.CFPage.CacheGet(CFPage.java:8183)
at cfCacheManager2ecfc1027664017$funcASSOCIATECACHEKEYEVICTIONSTORES.runFunction(C:\-------\service\utility\CacheManager.cfc:68)

We are definitely using the latest JDK - that was one of our check list items to try and help things. Upgraded to 1.6.22 on all clients and Terracotta servers. I will also try this jvm hint and report back... I'm thrilled to have found a forum thread talking about my exact issue (seemingly, so far).

The param didn't help things, we still have a couple dozen hung threads matching my previous post. We added the parameter to the clients though... I thought that made sense, but we're going to try it with the server, too.

Does anyone have any tips on how to more closely inspect what it is that's hanging up those threads? What confuses me is that the Terracotta server isn't overly stressed out on CPU, network, or memory when this is happening. I just have a hard time accepting the idea that the ColdFusion client is unable to contact or get a response back from Terracotta - if that's how I should be interpreting these hung threads.

We testing jdk1.6.22. Same issues.

Now we will try jvm hint -XX:+UseMembar.

I'll report results

We have with -XX:+UseMembar the same problems
App runs 5 days with no problems but then crash with same problems:
many threads waiting for a wake up.

´ sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:842)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1178)
java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:807)
net.sf.ehcache.concurrent.ReadWriteLockSync.lock(ReadWriteLockSync.java:53)
net.sf.ehcache.constructs.blocking.BlockingCache.put(BlockingCache.java:204)
de.pantarhei.webdb.caching.CacheServiceBean.put(CacheServiceBean.java:204)

see attached thread dump.

any ideas to solve this problem ?

targit

PS: Wondering, JDK-ConcurrentHashmap use same ReentrantLocking-Mechanism and dont block ? Maybe a Lock.notifyAll() missed ?

While we have not directly solved the problem, we have a work around. The hung threads were occurring after large mark sweep GC's in ColdFusion. Rather than focus on the hung threads, we focused on getting the GC cycles under control.

Moving to concurrent GC (-XX:+UseConcMarkSweepGC ) and altering the ratio to increase the size of the young generation helped us prevent the large mark sweeps that were resulting in the hung threads.

So the mystery still stands, and it's something we'll work toward in order to understand that better. In the meantime, our immediate problem is solved.

We are using -XX:+UseConcMarkSweepGC but dont work.

we are plan to fallback to 1.6.2 :(

Does anyone have a reproducible case that we can take a look at? Would love to help track it down

Sorry to ask the obvious, but are you sure no code path does a get() on the BlockingCache and then doesn't do a put() in case of a cache miss ?