Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Guys, I have just started looking at terracotta and I am considering it for a batch processing application framework we are building. We talk to multiple databases (SQL and NOSQL) and need to handle huge amounts of data per hour. We have built several processing components that are all configured and managed using Spring IOC. So far we have been executing jobs in a single JVM for testing, but now we want to scale this out and setup a grid to partition the processing of the data. We have four different types of processors (imagine also as 4 steps the data goes through). But each processor does not depend on the data of the other.

I would like to have a cluster of terracotta instances that share queues of work items that are processed by Workers that subscribe to a Work manager for the cluster. Some of the processing also requires accessing data that we want to cache (right now we are using simple EhCache) and at the same time we would like to build this framework to be configurable using Quartz to manage scheduling.

My question now is how to get going... I have read everything on the website and in the wiki, and now I am confused whether I should be using TC DSO or TC express... I don't find much documentation on TC Express (at least not as much as there was for DSO which seems like its now discouraging its use).

Any ideas or pointers in the right direction would be of great help.

Thank you!
-H

You should use TC express for your use case. To cluster the instances, you need to use distributed Ehcache. You can get enough idea on terracotta website:
http://terracotta.org/documentation/enterprise-ehcache/installation-guide

To use Quartz for managing scheduling, You again need to use distributed Quartz:
http://terracotta.org/documentation/quartz-scheduler/installation-guide

Hope, this will help you to start.

Cheers!!!

Thanks Gyan10.

I have been reviewing the documentation you reference in your post.

Up to now, what I am thinking is the right approach (using TC express) is the following:

- use ehcache in the same way we have been using it to cache data that the jobs need to access as part of their processing but clustering it.
- Use Quartz to handle job scheduling and cluster the job stores

Now, the part I am a bit still not sure how to use is the TC toolkit (http://terracotta.org/documentation/terracotta-toolkit/toolkit-usage) with TC express to implement the Master/Worker pattern.

The idea I had was to model the different queues of "units of work" (WorkItems) that need to be processed by the jobs (Workers) and managing the scheduling with Quartz.

I don't see much reference documentation about this with TC express (But quite a bit with DSO) is there any example you can point me to?

Thanks!

To use Toolkit in TC express, Please visit below URL to see sample of Toolkit queues.

http://www.terracotta.org/documentation/terracotta-toolkit/toolkit-usage

and

http://www.ashishpaliwal.com/blog/2010/08/getting-started-with-terracotta-toolkit-part-1/

Cheers !!

Why Master/Worker pattern? I am curious.

Have you thought about QuartzWhere for that work pattern?

--Ari

I was under the understanding that QuartzWhere is only part of terracotta enterprise?

The rationale behind the usage of the m/w pattern is the following:

I have 5 queues of work. Each queue is a different processing type and I want those queues to be distributed across a cluster (hence I will distribute the queue of WorkItems across the cluster). Each item its handled by a specific type of worker / processor that knows what to do with that data entity. One queue for example is of ids of files that need to be processed (and I need to make sure there is no possibility that more than one worker can be processing the same file). Another queue contains a collection of items that need to be inserted in a database and similarly I don't want workers grabbing the same item for processing. When that WorkItem is completed I want to add the result of that to another clustered queue that injects the results into another type of database for later processing... and you get the idea...

I want to be able to distribute these queues that are processed by different workers and orchestrate different scenarios (using for example WorkListeners) of adding and managing workers and workitem queues as well as errors and other notifications between WorkManagers.

Ok. Makes sense. And yes, QuartzWhere is enterprise. Now that I think about it, I don't like that but I'll deal with that offline.

As for your use case, to think of it as queues is most logical. But, might I suggest a cache + listener approach. It'll be as easy to implement and will end up leveraging products and features we tune more / most.

If you have a cache of work / ids / what have you, then everyone listens to the cache puts() and races to remove(), don't you have an event-based architecture w/o master/worker pattern, specifically?

Just a thought...

--Ari

Hi Ari,

Thats an interesting approach, and more in line with the kind of answer I was hoping for when I did my initial post :) thanks!

Ok, following up on that thought...

Lets say I create various caches to maintain the data that needs to be processed/file ids etc... (basically a replacement for the "WorkItem" queues correct?). For some of those, specifically the ones that have to deal with data processing and later data "writing" to different databases, what kind of precautions or considerations I should take with the size of the cached entity? See... in the original design, a WorkItem, in one of those cases was sort of like a "Work order" that included a collection of items to write to the database. This collection could be a few hundred items, must these objects be serializable? and what performance issues could I be facing by using disributed ehcache caches across the cluster? or do you think I am OK and should be fine?

Additionally, if I leverage ehcache caches (using the terracotta object store) is there a way to make the system robust and resilient to nodes failing? Say, for example if a node removes a work item from the cache (queue) to process... but fails during execution, it would be really nice to have a way for another node to take over that work item (work order) and process it in its behalf.

on a side note, the whole worker/master idea was originated in some older articles using TC DSO that I found:
http://www.infoq.com/articles/master-worker-terracotta
and
http://jonasboner.com/2007/01/29/how-to-build-a-pojo-based-data-grid-using-open-terracotta.html

and then finding the library already implemented in DSO (tim-masterworker) as referenced here:
http://forums.terracotta.org/forums/posts/list/1721.page

But I am always in preference of the simple approach thats what triggered this thread and series of questions.

Thanks for your thoughts around this!

Ari,

I guess you haven't had a chance to reply to my last message, but in the mean time I was looking at your recommendation and I have a question:

How do you guarantee that 2 running instances of the same class won't get the same element in the cache at the same time?
Meaning, if the running working processing class is listening for put() in the Cache in more than one instance, and gets a read() on the same cache item thats a problem (we need to read before we can remove() since I don't see a method that removes returning the element. Is there a way to prevent that?
Is that achieved doing:
..got notification of the put()
Try to get a lock on the key tryReadLockOnKey()
if not, we assume another instance took it.
if yes: get() -> process the item if successful -> remove()
I am assuming we don't need to release the lock on that key, however if there is an error caught during processing then we catch it, report it and do a release on the lock to see if another instance can try it successfully.
There is the opportunity to add all sorts of metadata to maintain state of the job in the cache (as a queue) as longs as the lock is guaranteed across the cluster.

I am assuming to do this successfully with a replicated terracota ehcache, we need to make it guarantee "strong consistency".

Let me know if I am on the right track.
Thanks!

why would remove() return an element? you told it what element to remove.

there's replace() if you are looking for that but cache operations (get, put, remove, replace, putifabsent) are all atomic on the cache. 2 nodes cannot remove the same element unless it was put back in between.

--Ari

Hi Ari,

Ok, I'll try to simplify to make the question more clear. What I meant was, instead of doing:

if(cache.tryReadLockOnKey(Object key, long timeout))
{
Element e = cache.get(Object key);
//.... Process element e code here....

cache.remove(Object key);
//question: do we need to do something about the read lock we got? even
//though we removed the object?
}

having a convenience method that does that in a "blocking" way.

Element e = cache.getAndRemoveWithLock(Object key);
//... process element here... and if it fails and an exception is caught then
//put it back in the cache.

But anyway, is the approach I am describing what you had in mind? (I am still missing the listener part (to notify on puts, etc...), need to figure that one out... if you have any pointers or examples that would be great...