[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Messages posted by: cdennis  XML
Profile for cdennis -> Messages posted by cdennis [83] Go to Page: 1, 2, 3, 4, 5, 6 Next 
Author Message
I've looked over the thread dumps here and I see no evidence of any kind of deadlock occurring in the product. I do see that you are using explicit locking on the cache - one thing that could cause this issue would be if there were unbalanced lock/unlock calls through the explicit locking API. It would be likely be worth your while to double check all of your code that interacts with the explicit locking API to confirm that there are no errors there.

Also, if you are a paying customer and have a support contract with us, it would definitely be best to pursue this through the regular support channels.

Chris
Now that the JSR-107 API is near completion, a number of developers at Software AG/Terracotta (company behind Ehcache) are about to kick off work on a native JSR-107 implementation. (We already have a JSR-107 implementation that delegates to ehcache, here: https://github.com/Terracotta-OSS/ehcache-jcache).

Obviously Ehcache has a feature set far larger than the scope of JSR-107 (something true of likely all cache implementations). The current Ehcache API that is by far the most widely used java caching API originated in the days of Java 1.4. Its time that the API be made current and incorporate JSR-107 constructs where applicable. We’d like to do this in as open a manner as possible so that we can get as much input from the community as possible.

More information can be found in Alex Snaps' blog post here: http://codespot.net/2014/02/26/starting-ehcache-3/
Discussion will happen here: https://groups.google.com/forum/#!forum/ehcache-dev
Code will appear here: https://github.com/ehcache

Looking forward to everyone’s contributions,

Chris Dennis

Software AG JSR-107 Expert Group Member
Terracotta/Ehcache Engineer
I believe I know whats going on here:

When you create the second cache manager the manager will run recovery over the data files created by the first cache manager instance. This recovery process create file mappings the unmapping of which is handled by the garbage collector. Since Windows doesn't allow the deletion of files with a mapped region open you won't be able to delete these files until after the GC runs and unmaps the mapped files. You don't quote the exception you get but I'm guessing it complains about not being able to delete a file with a mapped region open.

You really have two ways of getting round this:

1. Call System.gc() a bunch of times and wait for the GC to unmap the file.
2. Use unique file names - you might also be able to use deleteOnExit(...) successfully here.

Hope this helps/explains,

Chris
Obviously this is JVM bug, but if you can reproduce this on a recent JVM (i.e. the most recent 1.6 or 1.7 release) then it's obviously something we'd want to look in to (to at least find a workaround). Given that you're currently using a 3 year old JVM though, there really isn't much I can think of that would help you at the moment (other than using an earlier version of Ehcache).

Sorry I couldn't be more helpful!
No, it wasn't intentional, just not something we considered. Sorry!
The quick answers to your questions are:

1. You can't clear statistics anymore. This may mean you need to take baselines.
2. Statistics get enabled as and when you use them. If you stop looking at them they then disable themselves after a configurable delay (defaults to 5 minutes).

Overall the statistics have a much lower overhead than they had in the past, and you also only incur the overhead of the statistics that you use, instead of the overhead of all statistics.

We are aware that there isn't much documentation on the new statistics features, and it's something that our documentation team should be working on soon.

Hope that all makes sense!
I've looked in the 1.4.1 source code and the code in that release doesn't call System.gc(). I don't know of any reason why disabling it would cause any problems with Ehcache - however modifying the GC behavior in any environment is always going to change behavior in some way (noticeable or not). That is going to be especially true for a system that is using RMI. I think the basic answer here is that Ehcache shouldn't be adversely affected - but other aspects of your system might be. In short you're probably going to have to "suck it and see".
I suspect that the System.gc() calls are coming from the JVMs RMI implementation. You may find it informative to look at the following: http://docs.oracle.com/javase/6/docs/technotes/guides/rmi/sunrmiproperties.html

In particular the "sun.rmi.dgc.server.gcInterval" and "sun.rmi.dgc.client.gcInterval" properties.
Just so I have some more information, what version of the product are you using?

One thing that immediately springs to mind is how are you validating that the offheap is not being de-allocated? If you're doing this by looking at the resident size of the process then this could be misleading. The offheap deallocation is controlled by the JVMs garbage collector so the actual release may not necessarily occur promptly. If you're seeing a genuine java reference based memory leak then that would be more concerning.
I believe I know what the root cause of the ConcurrentModificationExceptions you are seeing is. Now obviously I have a far from complete set of information so some of this is a stab in dark. I believe the logging line you alluded to in your original post is this one in DiskStorageFactory.java:
Code:
 LOG.error("Disk Write of " + placeholder.getKey() + " failed: ", e);
 


I believe that the CME is coming from your serialization code and is due to your code mutating a collection within one of your cache keys or values from another thread while Ehcache is trying to iterate over the collection during serialization. This is something that would need to be fixed on your end by ensuring that the objects you put in the cache are immutable - or at worst can handle being concurrently serialized and mutated.

This is as much speculation as I'm prepared to indulge in regarding your issues without receiving any more information. If you can provide us with more details, regarding your cache configurations and the stack traces you are seeing then we may be able to be more insightful. If you need to obfuscate or censor sensitive class, package or method names in the stack traces or configurations that is fine, as long as all the non-privileged information is still intact.

Thanks,

Chris
Hi Tomas,

You're quite correct - what's happening is there isn't enough space to store your key/value pair in the cache currently - and the attempt to allocate space has failed (since the offheap is exhausted). As you guessed this will trigger eviction of other entries in order to store the key/value pair. This logging is at a level that would be below most logging frameworks default thresholds - so you're probably running with more detailed logging switched on. Most people will never see this message, and I would probably recommend you raise your logging thresholds since it's not really very informative and is just going to cause your logs to be overly large.

Chris
The disk persistence settings do not work in that way. If you run with localTempSwap on it will put everything to disk initially. The offheap and heap will act as read caches for that disk layer. This is why you don't see the write throughput you expect.
If you are trying to track down where all that RSS is coming from there are a couple of things you can try.

1. Switching on -XX:+PrintGCDetails will dump the heap details along with the stack dumps (e.g. using kill -3). This will give you the start and end addresses of the various heap regions.

2. If you're not averse to doing some reflection hackery there is a private static field in the java.nio.Bits class called reservedMemory which will tell you how much memory has been reserved for NIO direct buffers.

3. In principle if you get a heap dump of the JVM you could look at the address fields of the "top-level" DirectByteBuffer objects and find the addresses for those buffers to try and match up with the pmap output (I have never actually tried to do this).

Are you by chance trying to run the server on Java 1.7.0_04 on a Windows machine? There is a bug in this JVM version that prevents the TC server from starting: http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7172708
I believe (although I haven't checked this) that Eclipse uses the Process.destroy() call to terminate spawned processes. From what I remember on *nix platforms this maps to SIGKILL, so you're not going to get the shutdown hooks run in that situation.

In terms of snapshotting .index files, that wouldn't be enough - you'll also need to snapshot the .data files too. What you could do in principle is the following (this is just a proposal not a recommendation - I really don't think this is a good idea):

1. Suspend all of your cache operations (presumably using some kind of external locking or thread co-ordination).
2. Request a flush.
3. Copy the .index and .data files for the cache to a safe place.
4. Resume your cache operations.

On recovery you can then copy the backed up files over the dirty ones, and then you should be able to restore the cache state. You may have to be careful with file timestamps here though since the startup code will check them to try and assert whether the files are 'clean' or not.

Again I really don't think this is a good idea, your better solution is to arrange for a more graceful termination of your code.

Regards,

Chris
 
Profile for cdennis -> Messages posted by cdennis [83] Go to Page: 1, 2, 3, 4, 5, 6 Next 
Go to:   
Powered by JForum 2.1.7 © JForum Team