Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

The JIRA issue associated with the bug that is most likely causing this can be tracked at https://jira.terracotta.org/jira/browse/CDV-1277.

This has now been fixed, the change should be included in the 3.1.0 release (assuming it causes no further problems).

Hi

I suspect this might be a bug in our instrumentation of the ConcurrentHashMap key set. Are you concurrently modifying the map (e.g. adding mappings) at the same time as calling the toArray() method?

Thanks,

Chris - Terracotta Engineer

P.S.

If my hunch about this is correct then you should find that passing an excessively sized String[] (greater than the map capacity) should mitigate this problem (although depending on the map mutation rate it might not completely eliminate it).

I have filed a JIRA issue for this as a bug. If you wish to, you can follow it here: https://jira.terracotta.org/jira/browse/CDV-1262

I would be quite interested to see the code for your Thread subclass that was triggering this behavior. You weren't by any chance overriding the getId method were you?

Hi Troy,

Looking at your stack-traces didn't reveal too much, however one of the engineers here at Terracotta has managed to reproduce the dead lock himself using AspectJ loadtime weaving in a DSO client. Inspection of his stack-traces leads us to believe this is an AspectJ bug (albeit one they probably haven't seen before).

Stack traces for the deadlocked threads are:
Code:

 "Job_Executor2":
         at org.aspectj.weaver.loadtime.Aj$WeaverContainer.getWeaver(Aj.java:245)
         - locked <0xafb85598> (a java.util.Collections$SynchronizedMap)
         at org.aspectj.weaver.loadtime.Aj.preProcess(Aj.java:89)
         - locked <0x8592f6b0> (a java.net.URLClassLoader)
         at org.aspectj.weaver.loadtime.ClassPreProcessorAgentAdapter.transform(ClassPreProcessorAgentAdapter.java:52)
         at sun.instrument.TransformerManager.transform(TransformerManager.java:169)
         at sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:365)
         at java.lang.ClassLoader.defineClass1(Native Method)
         at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
         at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
         at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
         at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
         at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
         - locked <0x8592f6b0> (a java.net.URLClassLoader)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
         at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
         - locked <0x8592f6b0> (a java.net.URLClassLoader)
         at javax.management.remote.generic.ServerIntermediary$RequestHandler.handleNotifReqMessage(ServerIntermediary.java:700)
         at javax.management.remote.generic.ServerIntermediary$RequestHandler.execute(ServerIntermediary.java:626)
         at com.sun.jmx.remote.generic.ServerSynchroMessageConnectionImpl$RemoteJob.run(ServerSynchroMessageConnectionImpl.java:266)
         at com.sun.jmx.remote.opt.util.ThreadService$ThreadServiceJob.run(ThreadService.java:208)
         at com.sun.jmx.remote.opt.util.JobExecutor.run(JobExecutor.java:59)
 
 "main":
         at org.aspectj.weaver.loadtime.Aj$WeaverContainer.getWeaver(Aj.java:235)
         - waiting to lock <0xafb85598> (a java.util.Collections$SynchronizedMap)
         at org.aspectj.weaver.loadtime.Aj.preProcess(Aj.java:89)
         - locked <0x8592c130> (a sun.misc.Launcher$AppClassLoader)
         at org.aspectj.weaver.loadtime.ClassPreProcessorAgentAdapter.transform(ClassPreProcessorAgentAdapter.java:52)
         at sun.instrument.TransformerManager.transform(TransformerManager.java:169)
         at sun.instrument.InstrumentationImpl.transform(InstrumentationImpl.java:365)
         at java.lang.ClassLoader.defineClass1(Native Method)
         at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
         at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
         at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
         at java.net.URLClassLoader.access$000(URLClassLoader.java:56)
         at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
         at java.security.AccessController.doPrivileged(Native Method)
         at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
         at sun.misc.Launcher$AppClassLoader.findClass(Launcher.java)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
         - locked <0x8592c130> (a sun.misc.Launcher$AppClassLoader)
         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
         - locked <0x8592c130> (a sun.misc.Launcher$AppClassLoader)
         at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
         at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
         - locked <0x8592c130> (a sun.misc.Launcher$AppClassLoader)

In summary:

JMX Thread: LOCKED[JMX ClassLoader] => LOCKED[SynchronizedMap] => LOCKING[SystemClassLoader] (due to an implicit class load of an AspectJ class)
Main Thread: LOCKED[SystemClassLoader] => LOCKING[SynchronizedMap]

So the locking order reversal is caused by the AspectJ code trying to do a class load during the transform of a class loaded by a user class loader while concurrently trying to transform a class loaded by the system class loader. The race between the two threads explains why you sometimes get through with no problems.

I suspect the AspectJ team might not have seen this before because their normal startup is single threaded. By the time most applications become multi-threaded most if not all of the AspectJ classes will have been loaded thus avoiding the problem.

We suggest that you file an issue with the AspectJ project (https://bugs.eclipse.org/bugs/enter_bug.cgi?product=AspectJ), and we would certainly be happy to back you up with our analysis.

Let us know your thoughts,

Chris

Hi David,

Thinking about your problem, it might be useful if we can clarify some stuff about the way you are using the map.

You're right about the behavior of the lock ping-pong when ServerB scans the map. However ServerA will not immediately reacquire the write lock, it will only request it when it is needed. I guess the question I'm asking is how many mappings will be mutated (not mutations of the existing value object) in the ~5 minutes between successive backup operations?

This will be the number of greedy write locks that will need recalling and then acquiring by ServerB during a backup. The recall and acquire the other way will be spread over the next 5 minute period so they are probably less of an issue. If this is a "small" number (definition of small depending on your response time requirements) then you'll probably be alright. If not then switching out for a clustered CHM (or performing dirty reads (depending on how important backup integrity is)) might be better.

Chris

This appears to be the same bug that has also recently been exposed in an internal system test when using Weblogic. My research indicates that this is caused by the slightly bizarre (but not outside the spec) behavior of one of Weblogic's class loaders. Adding an exclude element in the instrumented classes section of your tc-config.xml for the class being parsed for annotations (weblogic.corba.cos.naming.NamingContextAny_WLStub) may be a workaround for your immediate problem. (Further exclusions may be necessary if it trips up on more classes).

How frequently do you get this behavior? My reading of the Sun bug you referenced and the LinkedBlockingQueue source code seems to indicate that the behavior reported to Sun should be impossible except under very unusual circumstances (e.g. the killing of threads which are mutating the queue).

If you are getting this behavior regularly in the absence of thread kills or L1 termination then I'd certainly be interested in seeing if we can track down the problem by reducing it to a more slimline test case.