[Logo] Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)
  [Search] Search   [Recent Topics] Recent Topics   [Members]  Member Listing   [Groups] Back to home page 
[Register] Register / 
[Login] Login 
[Expert]
Clustering Lucene: Problem in sharing an IndexWriter instance with Terracotta.  XML
Forum Index -> General
Author Message
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi all,
I'm working on a small projet for sharing a Lucene application.

First I tried to run the example given at http://svn.terracotta.org/svn/forge/projects/labs/terracotta-lucene-examples/trunk/terracotta-lucene-examples
But it doesn't work: when I stop and restart the example application (org.terracotta.lucene.example.RAMDirectoryExample.java), all the data in the RAMDirectory instance is gone. I tried to debug it (I think there would be a problem in writing the index). But I don't suceed.

Then I decided to write a simple code to test. Well I suceed in sharing the RAMDirectory instance with Terracotta. But I cannot do a multiple write. Since Lucene forbid using multiple IndexWriter instances to modify an index (I read it from the book "Lucene in action", section 2.9.2 Thread-safety, page 60).
So I have to share an IndexWriter instance for all my JVMs. I tried to do that, but I don't succeed. There are a lot of things to configure with Terracotta: instrumented classes, locks, ...

Here is my code:
Code:
 /**
  * A simple example of an in-memory search using Lucene.
  */
 import java.io.BufferedReader;
 import java.io.IOException;
 import java.io.InputStreamReader;
 import java.io.StringReader;
 
 import org.apache.lucene.search.Hits;
 import org.apache.lucene.search.Query;
 import org.apache.lucene.document.Field;
 import org.apache.lucene.search.Searcher;
 import org.apache.lucene.index.IndexWriter;
 import org.apache.lucene.document.Document;
 import org.apache.lucene.store.RAMDirectory;
 import org.apache.lucene.search.IndexSearcher;
 import org.apache.lucene.queryParser.QueryParser;
 import org.apache.lucene.queryParser.ParseException;
 import org.apache.lucene.analysis.Analyzer;
 import org.apache.lucene.analysis.standard.StandardAnalyzer;
 
 public class MyMain
 {
 	public static RAMDirectory idx = new RAMDirectory();;
 	public static IndexWriter writer; 
 	
   public static void main(String[] args)
   {
    
    try {
 		writer = new IndexWriter(idx, new StandardAnalyzer(), true);
 	} catch (IOException e) {
 		e.printStackTrace();
 	}
     try
     {
       System.out.println("Start ...");
 
       writer.addDocument(createDocument("Doc 1", "today is a day"));
       writer.optimize();
       writer.close();
 
       writer.addDocument(createDocument("Doc 2", "today is a new day"));
       writer.optimize();
       writer.close();
 
       System.out.println("Done!");
 
       /*Searcher searcher = new IndexSearcher(idx);
       search(searcher, "today");
       searcher.close();*/
     }
     catch (IOException ioe)
     {
       ioe.printStackTrace();
     }
     catch (Exception e)
     {
       e.printStackTrace();
     }
   }
 
   /**
    * Make a Document object with an un-indexed title field and an indexed
    * content field.
    */
   public static Document createDocument(String title, String content)
   {
     Document doc = new Document();
 
     // Add the title as an unindexed field…
     doc.add(new Field("title", title, Field.Store.YES, Field.Index.NO));
 
     // …and the content as an indexed field. Note that indexed
     // Text fields are constructed using a Reader. Lucene can read
     // and index very large chunks of text, without storing the
     // entire content verbatim in the index. In this example we
     // can just wrap the content string in a StringReader.
     doc.add(new Field("content", new StringReader(content)));
 
     return doc;
   }
 
   /**
    * Searches for the given string in the "content" field
    */
   public static void search(Searcher searcher, String queryString)
       throws ParseException, IOException
   {
 
     // Build a Query object
     QueryParser parser = new QueryParser("content", new StandardAnalyzer());
     Query query = parser.parse(queryString);
 
     // Search for the query
     Hits hits = searcher.search(query);
 
     // Examine the Hits object to see if there were any matches
     int hitCount = hits.length();
     if (hitCount == 0)
     {
       System.out.println("No matches were found for \"" + queryString + "\"");
     }
     else
     {
       System.out.println("Hits for \"" + queryString
           + "\" were found in quotes by:");
 
 
 
       // Iterate over the Documents in the Hits object
       for (int i = 0; i < hitCount; i++)
       {
         Document doc = hits.doc(i);
 
         // Print the value that we stored in the "title" field. Note
         // that this Field was not indexed, but (unlike the
         // "contents" field) was stored verbatim and can be
         // retrieved.
         System.out.println(" " + (i + 1) + ". " + doc.get("title"));
       }
     }
     System.out.println();
   }
 }
 
 


My tc-config.xml:
Code:
 <?xml version="1.0" encoding="UTF-8"?>
 <con:tc-config xmlns:con="http://www.terracotta.org/config">
   <servers>
     <server host="127.0.1.1" name="localhost">
       <dso-port>9510</dso-port>
       <jmx-port>9520</jmx-port>
       <data>terracotta/server-data</data>
       <logs>terracotta/server-logs</logs>
       <statistics>terracotta/cluster-statistics</statistics>
     </server>
     <update-check>
       <enabled>true</enabled>
     </update-check>
   </servers>
   <clients>
     <logs>terracotta/client-logs</logs>
     <statistics>terracotta/client-statistics/%D</statistics>
     <modules>
       <module name="clustered-lucene-2.0.0" version="2.6.1"/>
     </modules>
   </clients>
   <application>
     <dso>
       <instrumented-classes>
         <include>
           <class-expression>org.apache.lucene.store.RAMDirectory</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.index.IndexWriter</class-expression>
           <on-load>
             <execute>self.segmentInfos = new SegmentInfos();</execute>
           </on-load>
         </include>
         <include>
           <class-expression>MyMain</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.store.RAMDirectory$1</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.store.Lock</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.search.DefaultSimilarity</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.search.Similarity</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.analysis.standard.StandardAnalyzer</class-expression>
         </include>
         <include>
           <class-expression>org.apache.lucene.analysis.Analyzer</class-expression>
         </include>
       </instrumented-classes>
       <roots>
         <root>
           <field-name>MyMain.idx</field-name>
         </root>
         <root>
           <field-name>MyMain.writer</field-name>
         </root>
       </roots>
       <transient-fields>
         <field-name>org.apache.lucene.index.IndexWriter.segmentInfos</field-name>
       </transient-fields>
       <locks>
         <autolock>
           <method-expression>* *..*.*(..)</method-expression>
           <lock-level>write</lock-level>
         </autolock>
       </locks>
     </dso>
   </application>
 </con:tc-config>
 


I'm using Terracotta 2.6.1 and Lucene 2.0.0.

I really need your helps to solve this problem: sharing an IndexWriter instance with Terracotta.

Thanks in advance.
Chu
ari

seraphim

Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline

Several things to get your kickstarted:

1. Either do TIMs (Terracotta Integration Modules) or explicitly cluster Lucene internals, but not both. Specifically, notice that you have <modules> stanza with Lucene 2.0.0 module and <include-classes> with Lucene classes identified for clustering. The TIM does everything that it needs on its own.

2. When you restarted everything and all your indexes were gone, did you restart Terracotta as well? Have you looked in your Admin console and confirmed you have data there. I suspect you do. But I suspect making the following configuration change is bad:

Code:
       <transient-fields>
          <field-name>org.apache.lucene.index.IndexWriter.segmentInfos</field-name>
        </transient-fields>
 


I am curious. Where did you read about Terracotta <transient-fields> setting and <onload> hook? Why did you decide to use them.

Also note that the configuration file you are sending here does not specifically enable permanent storage in the TC Server. You will want to do that before you are done.

3. Lucene RamDirectory is not well-suited for clustering. Use Compass's custom TCDirectory with Lucene if you must (or just use Compass itself).
http://www.compass-project.org/

--Ari
[WWW]
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi ari,

ari wrote:
Either do TIMs (Terracotta Integration Modules) or explicitly cluster Lucene internals, but not both. Specifically, notice that you have <modules> stanza with Lucene 2.0.0 module and <include-classes> with Lucene classes identified for clustering. The TIM does everything that it needs on its own.  

At the beginning I just used only Lucene TIM. I also declared a RAMDirectory as a shared root.
But when I restart my application, and I do keep my TC server on, the application doesn't work: the RAMDirectory instance is still there but the result query shows nothing (like the RAMDirectory instance had been erased, but it's not since I checked its last modifed time).

ari wrote:

I am curious. Where did you read about Terracotta <transient-fields> setting and <onload> hook? Why did you decide to use them.
 

About all the strage things in my config, I made them in trying to sharing an IndexWriter instance.
When I do that, there are a lot of exceptions thrown out like : non portable class, lock, ...
To "debug" these exceptions, I declared some classes as instrumented (so these classes would be portable), some locks.
But it turned out that a few classes are really non portable. It means even I declared them like instrumented classes, they are still non portable. So I made their instances transient. And to initiate these instances, I used <onload>script. I'm not sure about the scripts, but I was only trying to make my application work.

ari wrote:

Also note that the configuration file you are sending here does not specifically enable permanent storage in the TC Server. You will want to do that before you are done.
 

I think if I don't stop the TC server during the tests, this option would not change the behavious of my application.
But I'll enable it to see if it's better.

ari wrote:

3. Lucene RamDirectory is not well-suited for clustering. Use Compass's custom TCDirectory with Lucene if you must (or just use Compass itself).
http://www.compass-project.org/
 

I'll try this.

Thanks for replying me.
CHU

ari

seraphim

Joined: 05/24/2006 14:23:21
Messages: 1665
Location: San Francisco, CA
Offline

Just to be clear, I do not know Lucene well. But I am curious how it will ever work if you set these things called SegmentInfo to transient and then construct them <onload>.

BTW, I think you did a great job pushing through this stuff. Our tools can sometimes lead you astray. And you seem to really understand why you made changes, so I am definitely impressed. Very well done.

That said, the confusing thing to me is why you need a root at all. When you look at your TC Admin Console (bin/admin.sh) do you see a connected client JVM each time your start your app wherein Lucene resides? And, do you see any data in the object browser in that same console? I guess you are hooked in to TC because you are getting TC exceptions so this is a dumb line of questioning on my part.

Can you bundle up your test and attach it here?

--Ari
[WWW]
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi Ari,
I made it work.

First I rewrite my projet, and in the TC configuration, I only do:
*adding the Lucene TIM
*declaring the shared root (RAMDirectory)
*and instrumenting one class (Directory - in fact it's an interface, but it doesn't matter)
Then it works. You are right about my TC config. I think I messed it by instrumenting some classes which are already contained in the TIM. It caused the strange behaviors of the application.

But there is a problem for me about TC locking (I'm gonna describing at the end). So I tried to switch to use the Compass library: sharing TerracottaDirectory instead of RAMDirectory.

I think this new class is very well done since it uses the latest version of Lucene (2.3.2) while the latest Lucene TIM is for Lucene 2.0.0.
Brief it also works. But the locking problem above is still there. It's lied with Lucene (by the way, I don't either know well about Lucene, I just have learned to use it since the last weekend):

Here is the procedure to write some document on an index.
*I create a new index writer.
*I add some text documents (one by one) to the writer.
*I optimize and close the writer.

So if I want to make this procedure is locked by TC, I need to surround my code with synchronized keyword. It would be something like this:

synchronized(myIndex){ // This will be locked by a TC write lock
*I create a new index writer.
*I add some text documents (one by one) to the writer.
*I optimize and close the writer.
}

The point is I don't know how many the documents will be added? It's interactive with users.
I have to surround (with synchronized keyword) a block that is interactively changed. It seems hard because normally, before I create an index writer, one TC write lock is on. So there is no thread who can enter this lock to modify, to add new documents needed by the first thread.

So what I do, is creating a method, whose argument is a document collection (a list).
The procedure of this method:

*I make a document list.
synchronized(myIndex){ // This will be locked by a TC write lock
*I create a new index writer.
*I add documents the list , to the writer.
*I optimize and close the writer.
}

It seems that the problem is solved. But I'd like to know if there is another solution (which better)?

I attach my code here (with all the libraries needed). There are not a lot of comments in my code, but it's really simple to understand. I think it's a good example to illustrate sharing a TerracottaDirectory. All remarks are welcomed.

Thanks for your helps.
Chu
 Filename Clustering Lucence: sharing a TerracottaDirectory instance.zip [Disk] Download
 Description Code + Libraries
 Filesize 3319 Kbytes
 Downloaded:  260 time(s)

tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

I haven't looked at your code yet. But I am not sure I understand why you need this extra level of locking? From what I remember of working on the Compass integration with Shay, you shouldn't need this kind of thing.

If you take out that locking, do you run into a specific problem?

My other suggestion would be to ask for help on the Compass list. They might know better how to solve your problem in the context of Compass.

[WWW]
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi Gautier,

tgautier wrote:
If you take out that locking, do you run into a specific problem?  


Yes it does have. For example, if you try to create 2 IndexWriter instances on the shared index with this code:
Code:
//MyTCDirectory.d is the clustered TerracottaDirectory instance 
 IndexWriter iw1 = new IndexWriter(MyTCDirectory.d, new StandardAnalyzer(), false);
 IndexWriter iw2 = new IndexWriter(MyTCDirectory.d, new StandardAnalyzer(), false);


You'll get an exception org.apache.lucene.store.LockObtainFailedException and its message: "Lock obtain timed out: TerracottaLock: write.lock class"
(This exception is made by the Compass team, in my opinion)

This problem is classic since Lucene always forbids having multiple writers on a same index. We can only have one at a moment. That why I was trying to share an IndexWriter instance, but I haven't succeeded for it's difficult to do this.

tgautier wrote:
My other suggestion would be to ask for help on the Compass list. They might know better how to solve your problem in the context of Compass.  

I will do that.

By the way, I've just checked the attached code, there are some problems with the libraries. I don't know why. But you can fix it by getting:
*lucene-core-2.3.2.jar from a Lucene binary package.
*compass-2.0.1.jar and commons-logging.jar from a Compass binary package.
*In the TC config, modify the TIM path to point at compass-2.0.1.jar like this:
Code:
<modules>
       <repository>/home/neo/Desktop/compass-2.0.1/dist</repository>
       <module group-id="org.compass-project" name="compass" version="2.0.1"/>
     </modules>


Thanks for reading.
Chu
tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

Right, that makes sense. Would it make sense to enqueue the write requests and service them serially?

If you want to lock/unlock in a single JVM - then don't use TC. If you are doing this across JVMs, then TC is the right strategy.

You should look into ReentrantReadWriteLock if you want to be able to lock/unlock decoupled from the call stack. There's a recipe in the cookbook that demonstrates ReentrantReadWriteLock usage:

http://www.terracotta.org/confluence/display/howto/Recipe?recipe=rrwl
[WWW]
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi Gautier,
As a matter of fact, I use TC lock to make my application "thread-safe" through all the JVMs.
I've just got a response of Shay Banon from the Compass forum which clarifies me a lot about the TerracottaDirectory. So if I have 2 JVMs, each of them tries to create an IndexWriter, then only one will succeed and the other will get an exception (I should have tested this before). That means at the end, I don't need a TC lock to keep my application "thread-safe" inside the TC cluster.

About the ReentrantReadWriteLock example , I find it very useful. I tried to do that and it works smoothly too.

However, I still have a few questions:

gautier wrote:
Would it make sense to enqueue the write requests and service them serially?  

Of course it's a very good idea. But does TC suport implicitly a queuring mecanism ? I mean when each of your JVMs requires one write lock at the same time, the TC server will give the lock to a JVM and make the rest wait. Then when the lock is released, the TC server will "choose" one JVM to give the lock. Am I right? And how the TC server chooses a JVM to give a lock?

One last question, when I use a ReentrantReadWriteLock lock (with TC enabled), what will it lock?
I made a small experiment:
*One JVM holds a ReentrantReadWriteLock write lock.
*One JVM tries to read the data without a ReentrantReadWriteLock read lock.
And it did read the "dirty" data. My hypothese is that the write lock doesn't specify to the cluster JVM the data needed to be protected (contrary to the synchronized keyword). So I think I need to use ReentrantReadWriteLock in all my code to keep my data persistance.

Thanks in advance.
Chu
tgautier

seraphim

Joined: 06/05/2006 12:19:26
Messages: 1781
Offline

neochu19 wrote:
Of course it's a very good idea. But does TC suport implicitly a queuring mecanism ? I mean when each of your JVMs requires one write lock at the same time, the TC server will give the lock to a JVM and make the rest wait. Then when the lock is released, the TC server will "choose" one JVM to give the lock. Am I right? And how the TC server chooses a JVM to give a lock?
 


No it doesn't. If it makes sense to queue on a single JVM, then I would just use an ThreadPoolExecutor http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html configured with just one thread max - which would automatically setup a queue and process items serially (but asynchronously).

If you wanted to do the same thing across JVMs, then you can cluster a queue. Using Terracotta. The cookbook has an example: http://www.terracotta.org/confluence/display/howto/Recipe?recipe=linkedblockingqueue

neochu19 wrote:
One last question, when I use a ReentrantReadWriteLock lock (with TC enabled), what will it lock?
I made a small experiment:
*One JVM holds a ReentrantReadWriteLock write lock.
*One JVM tries to read the data without a ReentrantReadWriteLock read lock.
And it did read the "dirty" data. My hypothese is that the write lock doesn't specify to the cluster JVM the data needed to be protected (contrary to the synchronized keyword). So I think I need to use ReentrantReadWriteLock in all my code to keep my data persistance.

Thanks in advance.
Chu 


Dirty reads are possible with Terracotta, that doesn't have anything really to do with ReentrantReadWriteLock or not - I wrote up another example in the Cookbook to illustrate dirty reads:

http://www.terracotta.org/confluence/display/howto/Recipe?recipe=dirty-read

[WWW]
neochu19

journeyman

Joined: 06/05/2008 04:05:59
Messages: 26
Offline

Hi Gautier,
Thanks again for your helpful responses.
Chu
 
Forum Index -> General
Go to:   
Powered by JForum 2.1.7 © JForum Team