Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

I hate to be so bold as to ask someone who might answer this to read such a long blog post, but is there a Terracotta solution that achieves something like what is described here:
http://natishalom.typepad.com/nati_shaloms_blog/2008/03/scaling-out-mys.html

In short, Nati describes an In Memory Data Grid (IMDG) that acts as an asynchronous tier between an application and a database. The actual system of record is the IMDG and transactions are run against that instead of the database.

Oracle Coherence seems to dominate this IMDG space right now, but the product is prohibitively expensive for startups or small businesses and dealing with Oracle is not fun even under the best conditions.

Yes. 2 ways to do this with TC:

1. The easy way: just use TC as a cache through a Map as your interface. TC is already durable to disk so you don't need some fancy IMDG to scale out your storage or to get pseudo-persistence via replication. Its fast and easy. You will need a Queue and a separate thread/JVM flushing the changes to the map asynchronously to the DB. If you want to do this, I need to write down the implementation of this pattern anyways so let me know.

2. The identical way: Terracotta WorkManager framework on our forge gives the scaled out in-memory model of IMDG. You still need one of the workers to be a drain on the queue of changes, flushing them asynchronously to the DB.

--Ari

Thank you for your reply.

ari wrote:

1. The easy way: just use TC as a cache through a Map as your interface. TC is already durable to disk so you don't need some fancy IMDG to scale out your storage or to get pseudo-persistence via replication. Its fast and easy. You will need a Queue and a separate thread/JVM flushing the changes to the map asynchronously to the DB.

This approach sounds interesting. Are there any articles or best practices I might look at that would help me to do a proof of concept for this kind of setup?

A separate, but ultimately related, question: Since Terracotta uses a centralized server, is it impossible to have an in-memory data set larger than the memory capacity of that one server? Other products might tackle this by partitioning the data across several servers, but I'm not sure how Terracotta would handle it.

Terracotta spills to disk. It can handle, in theory, as much data on a single server as you can fit on your commodity harddisk.

The only catch is that not all parts of our server logic handle an infinite number of objects. We are doing 2 things over the next few releases:

1. Tuning the server logic (about which I am being vague because I don't want to make promises till we deliver the product).

2. Making our TC Server capable of striping (run more than 1 transparently under any use case).

For now, the work-around is what we call "Locality of Reference". (Well, not our term but we use it as a work-around.) What this means is that, if you use some sort of work division or load balancing algorithm in your application layer (where your JVMs run) you will deflect load naturally off of our TC Server. Then we can handle lots and lots and lots of data.

As for the write-up on TC as a write-behind cache of the db, thus inverting the model and making the cache the SoR, I can get on that now. No one has written it till now.

I owe it to several folks by now. What are your time frames? Could you maybe PM me the use case so I can make sure I am not wasting your time? (Or post it here if you are able to.)

Cheers,

--Ari

ari wrote:

As for the write-up on TC as a write-behind cache of the db, thus inverting the model and making the cache the SoR, I can get on that now. No one has written it till now.

I owe it to several folks by now. What are your time frames? Could you maybe PM me the use case so I can make sure I am not wasting your time? (Or post it here if you are able to.)

There is no rush on my end. I am merely investigating the use of IMDG's in general. During my last few projects I have grown increasingly frustrated with the standard pattern of achieving scalability in mid-size data centric applications (small size being simple internal app and large being the huge, multi-million users apps that require extremely customized infrastructure). These mid size apps still need to be clustered but they shouldn't require the specialized infrastructure of a large scale app. There are plenty of solutions like distributed caches for hibernate, but I just can't help but feel that there is a better, faster way. I've experimented with in-memory object databases, Oracle Coherence, GigaSpaces EDG. Each has their pros and cons. I guess Coherence has been the best fit so far, but it has the double whammy working against it: 1) being from Oracle, a difficult (to put it kindly) company to work with, and 2) being prohibitively expensive.

I'd like to find a solution, or at least repeatable pattern, that would help me on these kinds of mid-size, data-centric applications.

If I can add my $0.02, what Ari is saying is that TC fundamentally has the right pieces - the lego blocks if you will - that others do not to build a superior solution.

TC:
* Durable heap to disk, so you don't just have an "IMDG" - you have a clustering solution whose memory reads at memory speeds but is as durable as the DB
* An insertion layer into your code that is non-intrusive, which is *critical* when dealing with Domain Designs. Distributed caches (and spaces) force you to break your domain design and key all of your objects. Bad things happen after that.
* Automatic data locality and partitioning - Terracotta manages replicating in the most efficient way. Only deltas to your object graph are pushed, objects automatically reside in JVMs that need them, and not in JVMs that don't (vs. alternatives that always force you to configure and partition up front)
* Ability to pick your stack
* Ability to code any data construct you like

What Ari is saying is that if you need your data to be safe, but not in the DB, don't even put it in the DB - who needs all that complexity? Alternatives cannot say this, because they can only replicate in memory, and must use an external SoR for safety.

IF you do need to put your data into the DB (e.g. it belongs there) then it's easy enough to write the implementation that can manage async replication for you - and the implementation in Terracotta ends up as POJOs not as a proprietary API (not that it really matters to someone who wants a turnkey solution, but for many, being able to crack the hood so to speak is a huge benefit).

Which is what Ari is promising - the pattern - to implement. In essence, it will turn out something like a listener and some queues and so on.

I have no great love for the database so a solution that can be durable but not require the DB is certainly something I'd consider.

Matt,

I sent you a PM on this topic. Let's collaborate! It will be fun.

--Ari

The link here http://www.terracotta.org/confluence/display/wiki/TechnicalFAQ#TechnicalFAQ-WhataremychoiceswithregardstoTerracottaclustereddistributedcaches%3F
refers to a document named Terracotta_DistributedCache_CommonTopologies.pdf (which shows as an attachment on the Wiki)

That highlights the approaches Ari talked about - such as the cases, where a cache need not be partitioned, and cases, where the cache could be partitioned across client JVMs and Terracotta servers.

Hope this helps as well.
thanks