[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
What is the best way to sort a big list with Terracotta?  XML
Forum Index -> General
Author Message
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi all,
I'm having a big list (from 200 000 to a few milions elements). I've already implemened the Comparable interface for my elements.
My goal is to find the top 50 elements from that list. I've tried TreeSet and Collections.sort.

With Terracotta enabled, the running time for sorting is up to 6 times slower (even there is nothing clustered in my test).

Is this overhead normal? Can somebody explain me why? And if there is a better way to sort with Terracotta enabled, please tell me!
Thanks a lots.
Chu

PS: the sorting operation is run inside one JVM. I can't make it smaller for it's already a basic operation in my application.
colonel

journeyman

Joined: 07/23/2008 03:22:04
Messages: 12
Offline

Hi chu,

I've got no experience running Terracotta -- but I understand how it works and what it does from a theoretical points of view.

There might be 2 issues with how you try to accomplish your sorting and a few questions to ask:

Issue 1: It might just be that the instrumentation of the classes you're using (TreeSet or other collections) by TC is slowing things down. If that's the case I don't think there's not much you can do but switch to a collection implementation that has not been instrumented by TC. (Correct me if I'm wrong here and one can tell TC that one wants to use a non-instrumented class in certain places.)

Issue 2: Synchonization. A few questions arise: Do you need the sorting done on the "live" collection, i.e. the shared collection? Am I safe to assume that it *is* a shared collection?
If it *is* shared, then you might want to synchronize access to it so that only one thread does the work of sorting and finding the top 50 elements and propagating those into a different collection. Then this can be done by TC in a single update across the cluster which should be very efficient.
If it is not shared, then this can't really be the issue, I guess.

(So Issue #2 is the plain old "try to batch modifications to lower network access" advice.)

My thoughts on this, maybe that helps :)

- Denis
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi Denis,
Thanks for replying me! I see your points!

But it's not quite that! Because I run the test with a minimun tc-config and there is no dso config, which means there is no classe instrumented, no object shared, no lock, ... However, maybe the collection I use (TreeMap and ArrayList) is pre-instrumented in Terracotta, so you can be right about this point.

About the collection, ArrayList is not modified (added or removed) when it's being sorted. Except for the TreeMap, it's modified (re sorted) when being added new element. I think that's why TreeMap is not a good solution for my case.

My goal is to do this operation localy in one JVM without any cluster constrains (I need this run as fast as it can), then I get the result (the top 50 elements) to send back to the cluster (50 small elements don't take a lot of bandwiths so this is not the problem at the moment).

Chu
hhuynh

cherubim

Joined: 06/16/2006 11:54:06
Messages: 761
Offline

I would resort to writing my own data structure, one that TC doesn't instrument by default and sort my data with it. The fastest way to see which classes are preinstrumented by Terracota is to open up its bootjar and look at the content.
steve

ophanim

Joined: 05/24/2006 14:22:53
Messages: 619
Offline

I'll throw my two cents in here.

If sorting a large list you should copy the list sort it and then replace it. The reason is that most sort algorithms do multiple moves per entry causeing multiple transactional operations per entry. This can be expensive an unnecessary. At some point we can do an optimized version of this so you don't have to but this is what I recommend for now.

Hope this helps

Want to post to this forum? Join the Terracotta Community
gbevin

praetor

Joined: 07/04/2007 09:09:42
Messages: 210
Offline

Another reason might be that to be able to do the sorting, all the elements of the list need to be faulted in. If this is a shared list that is partitioning over many nodes, the latency of the network access to fault everything in could cause such a slowdown.

Want to post to this forum? Join the Terracotta Community
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi all,
Thanks for your helps.

hhuynh wrote:
I would resort to writing my own data structure, one that TC doesn't instrument by default and sort my data with it. The fastest way to see which classes are preinstrumented by Terracota is to open up its bootjar and look at the content. 

Today I've tried to make my ArrayList non instrumented in TC.
First I tried to declared it "exclude" in the instrumented class section of the tc-config. It doesn't work since the class is contained in the boot jar, so it's instrumented everytime I launch an TC application.
Then I copied the ArrayClass to make MyArrayClass which is used by my application and will not be declared as instrumented in TC. However, the ArrayClass uses some other classes (pre instrumented in the boot jar too), so I needed to do the same thing for these classes.
Finally, I have copied 3 classes : ArrayList, Collections, Arrays. The performance is much better. Before, TC overhead made the test 6 times slower. Now it's only 1,7 times slower. I think it's acceptable.

steve wrote:
I'll throw my two cents in here.

If sorting a large list you should copy the list sort it and then replace it.  

Do you mean I copy the main list to a local list, then I sort the local list and finally replace the main list by the local list?
If it's that, I tried it already but it didn't work since the main list is also local. If it's not what you want to say, please clarify this idea for me!

gbevin wrote:
Another reason might be that to be able to do the sorting, all the elements of the list need to be faulted in. If this is a shared list that is partitioning over many nodes, the latency of the network access to fault everything in could cause such a slowdown. 

It could be a reason, but I think it's not the problem here. In my test, the collection is not shared (as I said above, my tc-config doesn't have the dso section). So the collection only lives in the JVM client and can't be flushed to the TC server.

Thanks all again for your helps.
Chu
 
Forum Index -> General
Go to:   
Powered by JForum 2.1.7 © JForum Team