[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Synchronization Performance Problem  XML
Forum Index -> Terracotta for Spring
Author Message
sgilbert

journeyman

Joined: 01/05/2007 09:55:01
Messages: 10
Offline

I've created a test program that is similar to the producer-consumer example and I have found that the time needed to get an item from a Queue object is very long.

Specifically, the program I wrote uses an instance of java.util.concurrent.LinkedBlockingQueue, defined as a Spring bean in my application context file like this:

Code:
<bean id="queue" class="java.util.concurrent.LinkedBlockingQueue"/>


That bean is then also configured as shared in my TC config file.

My test client program puts 2000 Long objects into the queue (if it is empty) and then proceeds to create and run 5 threads, each which has a while loop that:

- calls queue.poll in a loop until the queue is empty
- creates an object, setting its value using the Long's value from the queue item. The object is then used to insert a row into a database table using Hibernate.

When running this program, I've seen the following performance:

Without Terracotta, using a single instance of the client program:
- queue.poll time is <1 ms and DB insert time is 11 ms

With Terracotta with a single instance of the client program:
- queue.poll time is 17 ms and DB insert time is 3 ms

With Terracotta with two instances of the the client program:
- queue.poll time is 46 ms and DB insert time is 4 ms

The TC server and client programs are running on one machine and the database server is on another. While running the two client scenario, the TC server takes about 30% of the CPU and the two clients take 20-25% each.

Before I post anything else, I wanted to ask if there is something obviously wrong with what I am doing. Is is "normal" for the synchronization to take this long and require the amount of CPU time I am seeing? I can't imagine a queue.poll call taking 10x as long as a database insert operation.

I have been assuming for 2 days that I am doing something wrong but I've not been able to figure out what.
kbhasin

consul

Joined: 12/04/2006 13:08:21
Messages: 340
Offline

Hello sgilbert,

I assume you do not have an upper-bound set on your queue which might be causing some of the high latency you are seeing. Try setting an upper-bound to the queue e.g. to achieve new LinkedBlockingQueue(50) by Spring Bean initialization, do something like:

Code:
 
 <bean id="queue" class="java.util.concurrent.LinkedBlockingQueue">
     <constructor-arg index="0"><value>50</value></constructor-arg>
 </bean>
 
 


Secondly, the five concurrent threads are also affecting the latency as all five threads are contending on the same lock. Is it possible to partition the shared data? If yes, then you might want to consider having a separate queue for each reader. If no, then you might want to consider using a concurrent queue implementation like java.util.concurrent.ConcurrentLinkedQueue<E>.

Lastly, are you using a read lock for readers and a write lock for writers? If your terracotta config.xml file has something like this in the locks section

Code:
 <locks>
   <autolock>
       <lock-level>write</lock-level>
       <method-expression>* *..*.*(..)</method-expression>
    </autolock>
 </locks>
 
 


you might want to replace it with something like this:

Code:
 <locks>
         <autolock>          
           <lock-level>write</lock-level>
           <method-expression>* java.util.concurrent.LinkedBlockingQueue.poll(..)</method-expression>           
         </autolock>
 
 	<autolock>          
           <lock-level>read</lock-level>
           <method-expression>* java.util.concurrent.LinkedBlockingQueue.peek(..)</method-expression>          
         </autolock>
 
 </locks>
 


I hope this helps. Let us know if this helps in reducing the latencies you are seeing and please feel free to contact me directly if you have any additional questions at kbhasin@terracottatech.com.

Regards,
Kunal Bhasin.

Regards,

Kunal Bhasin,
Terracotta, Inc.

Be a part of the Terracotta community: Join now!
tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

sgilbert,

For the two node case, you are probably by now aware that you are running into a classic distributed problem which centers around lock contention. Lock contention in a single VM doesn't normally rear it's ugly head, except for the most performance sensitive apps, and thus most people don't see it. The main reason for the java.util.concurrent classes is to provide significantly better performance under concurrent situations, which is precisely where lock contention becomes an issue.

So what's really happening? Let's look at your test...

VM1: while(1)
poll() -->
acquire_lock();
read head from queue
modifiy queue to delete head
release_lock();
return item;
process_item();

VM2: while(1)
poll() -->
acquire_lock();
read head from queue
modifiy queue to delete head
release_lock();
return item;
process_item();

In this case, for the most part, you are always contending for the single write lock which protects the queue during the poll operation. Terracotta attempts to be fair in this case, and so it ping-pongs lock-grants between each one. There is network latency (from the wire, from your tcp stack etc.) which corresponds to each acquire_lock(), and thus you see the latencies in two VMs versus a single VM increase.

So for that reason, as Kunal suggested, you will need to consider changing your strategy to help reduce per-item contention.

Kunal already suggested partitioning your data to take advantage of concurrency.

Another strategy that you might consider is batching. A batching strategy will help to decrease the perceived latency of each individual item. You can use the drainTo() method with a maxElements of say 4 or 5. This should dramatically increase your overall throughput. I would suspect that the bottleneck in this scenario will shift from the contended lock to the DB.

Finally, you can also implement a queue-per-reader strategy which will reduce the lock contention, for readers, to zero. I mention this one last because it will of course require a code change. I have previously explored, with some success, making a data structure that looks like a queue on the outside, but implements multiple internal queues for each reader and writer, on the inside. Because you are using Spring, this would be as simple as changing the implementing class for your bean. I hope to add this work, as well as other work I have done in the past, to the Terracotta Forge which is soon-to-be-released (you can drop in on the forge anytime at http://www.terracotta.org/confluence/display/orgsite/Open+Terracotta+Forge)

Now, all that said, there are a lot of optimizations in specific areas that we would like to implement. In particular, for a queue, the following improvements in Terracotta would help your test in the simple case:

http://jira.terracotta.org/jira/browse/CDV-70

http://jira.terracotta.org/jira/browse/CDV-71

I hope that helps,

Taylor
[WWW]
sgilbert

journeyman

Joined: 01/05/2007 09:55:01
Messages: 10
Offline

I appreciate the quick responses. Based on the information provided, it sounds like you are both saying that there isn't anything I did blatantly wrong with my Terracotta server or configuration.

I am also inferring from the information that a 50 fold increase in the amount of time spent waiting for a lock when it is extended from a multiple threads in a VM to a cluster by Terracotta is not unexpected.

I do not have an upper bound set on my Queue. I put 2000 Longs into the queue as part of the setup of the test. In our real use case, we planned to have many 10s or possibly 100s of thousands of items in the queue.

With respect to locking and my config file: I have not put anything in the config file. The locking that is happening is being chosen automatically by the server or is inherent to the Terracotta version of java.util.concurrent.LinkedBlockingQueue.

With respect to partitioning the data, changing to a queue-per-reader, pulling multiple items off the queue at one time: One of the primary things that attracted me to Terracotta was the idea that code didn't have to change. This test code that I have now works well in a multithreaded environment in a single VM, with the wait time to get an item from the queue being very small compared to the DB operation time. One of the promises of Terracotta is that if the code works well multithreaded in a single VM, it should not need to be changed to run in multiple VMs using Terracotta. One particular use case for us is a single work queue for many client threads in several VMs, hence my test program. In that use case, pulling multiple items off the queue or any of the other suggestions is possible but undesirable because it makes the client code more complex, specifically in doing error recovery. So the answer to the question about these alternate approaches is "maybe, but I don't want to". The alternative approach to covering this use case is a single JMS message queue. The real use case is not 2 VMs with 5 threads each but rather 4 VMs with 100 threads each. I've not done testing to see how a JMS server would handle 400 clients pulling single items off the queue, but based on what I have done so far with my test program, I expect the wait time would be less. If a simple JMS-based solution takes even longer, then using Terracotta would be preferred over it, and at that point we'd consider optimizations.

With respect to using java.util.concurrent.ConcurrentLinkedQueue: When I started this exercise, I tried to use that class but immediately found it was not in the boot jar and therefore not an option. I was very surprised to see how few of the classes in java.util.concurrent are in the boot jar and available for use.

With respect to specifying write and read locks in the config file for the methods of LinkedBlockingQueue: I had assumed since this class was in the boot jar that those kind of optimizations were already inherent in the Terracotta version of the class (in the boot jar) and that I did not have to specify them.

kbhasin

consul

Joined: 12/04/2006 13:08:21
Messages: 340
Offline

Hello sgilbert,

There is always going to be some overhead when you move from a non-clustered single JVM environment to a clustered multiple JVM environment. Based on our benchmarks and our customers, Terracotta has consistently been one of the market leaders in terms of performace, amongst other things. Compared to other clustering approaches (like JMS), Terracotta does not use serialization and only sends field level deltas for highly optimised performance.

Having said that, would it be possible for you to share the test with us along with the following environment information so we can try to reproduce the latency numbers you are seeing?

1. I am assuming you are using Terracotta Spring?

2. What version of Terracotta are you using?

3. What JVM are you using - Sun Hotspot, JRockit, IBM, Other?

4. What JDK/JRE are you using - 1.3x, 1.4x, 1.5x, 1.6x?

5. What operating system are you using?

6. How much memory is allocated to the Terracotta client and Server JVMs?

Regards,
Kunal Bhasin.

Regards,

Kunal Bhasin,
Terracotta, Inc.

Be a part of the Terracotta community: Join now!
sgilbert

journeyman

Joined: 01/05/2007 09:55:01
Messages: 10
Offline

Kunal- thanks for the response.

Getting the test program material together and posting it will take a bit of time, but I can at least answer your questions right now.

1. I am assuming you are using Terracotta Spring? 

Yes.

2. What version of Terracotta are you using? 

2.2, downloaded from terracotta.org.

The server prints this at start up:

Code:
2007-01-15 17:49:33,792 INFO - Terracotta, version 2.2 as of 20061203-151234.


3. What JVM are you using - Sun Hotspot, JRockit, IBM, Other?

4. What JDK/JRE are you using - 1.3x, 1.4x, 1.5x, 1.6x?
 

Sun 1.5.0_10.

I believe the Terracotta server is running with whatever was shipped with it.

5. What operating system are you using? 

Windows XP SP2

6. How much memory is allocated to the Terracotta client and Server JVMs? 

The Terracotta server is being started with start-tc-server.bat in the spring/bin directory which generates this command line:

Code:
"C:\tools\Terracotta2.2\terracotta-2.2\common\lib\tc.jar"  
 -server -Xms256m -Xmx256m -Xss128k 
 "-Dtc.install-root=C:\tools\Terracotta2.2\terracotta-2.2" 
 com.tc.server.TCServerMain

I added newlines for readability.

Note that I have removed the two JMX-related arguments, hoping they were causing the long wait times, but that had no effect.

I had been running my test clients with the default JMV memory, but just added -Xms and -Xmx values of 128m, ran them again, and got the same results.
tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

sgilbert wrote:

With respect to partitioning the data, changing to a queue-per-reader, pulling multiple items off the queue at one time: One of the primary things that attracted me to Terracotta was the idea that code didn't have to change. This test code that I have now works well in a multithreaded environment in a single VM, with the wait time to get an item from the queue being very small compared to the DB operation time. One of the promises of Terracotta is that if the code works well multithreaded in a single VM, it should not need to be changed to run in multiple VMs using Terracotta. One particular use case for us is a single work queue for many client threads in several VMs, hence my test program. In that use case, pulling multiple items off the queue or any of the other suggestions is possible but undesirable because it makes the client code more complex, specifically in doing error recovery.
 


I completely agree with everything you have said. We are definitely working very hard to make the simple things like your basic queue test perform as well as possible.

However as Kunal already pointed out, it's not an apples to apples comparison if you compare a single node case to a multiple node case. You have to compare a multiple node case to a multiple node case, for example it would be fair to compare JMS or writing the items into a database (and polling the DB to retrieve the items) to simulate what a comparable solution would offer.

Also I think there are some points to be made...

First of all the error-recovery issue is a real one, but pulling one item off the list versus multiple doesn't resolve it, it just reduces the chances of it happening. You can still lose the single item in the event of node-death.

Second, the batching strategy is not something that would break your application, nor it's implementation in any way. If you implement the batching strategy, you'll not only improve the performance of your app when it is distributed, you'll also improve it's performance if you take away Terracotta and use the app in the single node case.

In many ways, I think it's useful to think of Terracotta like Java Garbage Collection. Often times it's not necessary to consider that there is a Garbage Collector, but it's also necessary in certain instances to avoid certain "Garbage Collector Anti-patterns". One of those is to try not to create and throw away a large amount of objects in small period of time. Is this small price to pay in terms of application design worth the advantages of Garbage Collection? Absolutely.

We think the same holds true for Terracotta.
[WWW]
sgilbert

journeyman

Joined: 01/05/2007 09:55:01
Messages: 10
Offline

Taylor wrote:
However as Kunal already pointed out, it's not an apples to apples comparison if you compare a single node case to a multiple node case. You have to compare a multiple node case to a multiple node case, for example it would be fair to compare JMS or writing the items into a database (and polling the DB to retrieve the items) to simulate what a comparable solution would offer. 

I agree that is not a fair comparison. I didn't expect the sync wait time to be very close to the sync wait time inside a single VM without Terracotta. I was hoping for maybe 4-10 ms but instead saw 50-60.

A colleague of mine has run tests with ActiveMQ (JMS Server) in a similar fashion to what I did in my test program and he reported times of 5 ms to get an item from a queue over TCP on the same machine. I am going to put that queue implementation into my test program to get an apples-to-apples comparison.

Taylor wrote:
Also I think there are some points to be made...

First of all the error-recovery issue is a real one, but pulling one item off the list versus multiple doesn't resolve it, it just reduces the chances of it happening. You can still lose the single item in the event of node-death. 

True.
Recovering from losing 1 item in a dead node would not be all that different from losing 5.

Taylor wrote:
Second, the batching strategy is not something that would break your application, nor it's implementation in any way. If you implement the batching strategy, you'll not only improve the performance of your app when it is distributed, you'll also improve it's performance if you take away Terracotta and use the app in the single node case. 

I guess I can agree with that, however there may be other effects when multiple items are taken off the queue at once compared to just one. In my example, I populate the queue and then test how long it takes to remove all those items. In our real application, the queue will be filled with a large number of items from time to time with the worker nodes waiting to pull those items off the queue. If 250 items go onto the queue and each worker node removes 5 at a time, if there are 100 worker nodes, only half of them will be working. In my test all that is happening with that queue item is a simple insert into the database, but in the real use case, more will be done to that item and in that case, pulling 5 items from the queue at once while half the workers sit idle would be a bad thing. In that case, the time to get the item from the queue becomes less important but not unimportant.

I think what is important is what portion of the time is spent getting the item from the queue and what portion is spent doing whatever processing is going to be performed on that item.

Also, how long the wait time for an item from a queue that is a DSO compared what the wait would be for an item from a JMS message queue is important-- the apples to apples comparison, which I have to complete.

Taylor wrote:
In many ways, I think it's useful to think of Terracotta like Java Garbage Collection. Often times it's not necessary to consider that there is a Garbage Collector, but it's also necessary in certain instances to avoid certain "Garbage Collector Anti-patterns". One of those is to try not to create and throw away a large amount of objects in small period of time. Is this small price to pay in terms of application design worth the advantages of Garbage Collection? Absolutely.

We think the same holds true for Terracotta.  

I get the analogy, and I am not opposed to making some design decisions that benefit whatever solution is used. The problem I have at this moment in time is that those wait times are so high to be prohibitive. I am still hoping, however, that I must be doing something wrong in my test program such that correcting it will at least get the times I am seeing to what I would see when pulling the items from a JMS queue.
 
Forum Index -> Terracotta for Spring
Go to:   
Powered by JForum 2.1.7 © JForum Team