Terracotta Discussion Forums (LEGACY READ-ONLY ARCHIVE)

Hi,
I have a problem when tc client got lots of concurrent requests :
Code:

 2008-09-02 11:11:07,298 [TCComm Main Selector Thread] INFO com.tc.net.core.TCConnectionManager - error event on connection com.tc.net.core.TCConnectionJDK14@745028604: connected: true, closed: false local=127.0.0.1:27868 remote=127.0.0.1:9510 connect=[Tue Sep 02 10:55:57 CST 2008] idle=728ms [90549 read, 8598825 write]: Broken pipe

tc client disconnected with tc server after that,and all the threads to handle the requests blocked to wait the reponses of tc.

BTW,I am using terracotta-2.6.2 with the module of EHCache.

i got this error again today

Code:

 2008-09-02 19:34:10,555 [TCComm Main Selector Thread] INFO com.tc.net.core.TCConnectionManager - error event on connection com.tc.net.core.TCConnectionJDK14@1688936871: connected: true, closed: false local=192.168.0.216:62066 remote=192.168.0.216:9510 connect=[Mon Sep 01 11:49:53 CST 2008] idle=117ms [39108458 read, 623439915 write]: Broken pipe
 2008-09-02 19:34:11,248 [TCComm Main Selector Thread] WARN com.tc.net.core.CoreNIOServices - Exception trying to shutdown socket output: Transport endpoint is not connected

Ok. It will help if you explain the environment and situation in more detail.

I am using ehcache to cache contact.

tc-config.xml:
Code:

 <?xml version="1.0" encoding="UTF-8"?>
 <con:tc-config xmlns:con="http://www.terracotta.org/config">
   <servers>
     <server host="127.0.0.1">
       <dso-port>9510</dso-port>
       <jmx-port>9520</jmx-port>
       <data>terracotta/server-data</data>
       <logs>terracotta/server-logs</logs>
 	<dso>
 		<persistence>
 			<mode>permanent-store</mode>
 		</persistence>
 	</dso>
     </server>
   </servers>
   <clients>
     <logs>terracotta/client-logs</logs>
     	<modules>
 			<module name="tim-ehcache-1.3" version="1.1.1"/>
 		</modules>
   </clients>
  <application>
     <dso>
       <instrumented-classes>
         <include>
           <class-expression>com.pqs.contact.Contact</class-expression>
           <honor-transient>true</honor-transient>
         </include>
        </instrumented-classes>
       <roots>
         <root>
           <field-name>com.pqs.contact.ContactProvider.cacheManager</field-name>
         </root>
       </roots>
         <locks>
         <autolock>
           <method-expression>void com.pgs.contact.ContactItem.*(..)</method-expression>
           <lock-level>write</lock-level>
         </autolock>
       </locks>
     </dso>
   </application>
 </con:tc-config>

source:

Code:

 
 package com.pqs.Contact;
 
 import net.sf.ehcache.Cache;
 import net.sf.ehcache.CacheManager;
 import net.sf.ehcache.Element;
 
 public class ContactProvider {
 	public CacheManager cacheManager=new CacheManager();
 	private Cache contactCache=null;
 	   /**
      * Returns the Contact for the given username.
      *
      * @param username the username to search for.
      * @return the contact associated with the ID.
      * @throws com.pqs.Contact.UserNotFoundException if the ID does not correspond
      *         to a known entity on the server.
      */
     public Contact getContact(String username) throws UserNotFoundException {
 		if (contactCache == null) {
 			cacheManager.addCache("contact");
 			contactCache = cacheManager.getCache("contact");
 		}
 		if (contactCache == null) {
 			throw new UserNotFoundException("Could not load caches");
 		}
 		Element contactEle = contactCache.get(username);
 
 		if (contactEle == null) {
 			Contact contact = new Roster(username);
 			contactEle = new Element(username, contact);
 			contactCache.put(contactEle);
 		} else {
 			System.out.println("getRoster from cache:" + username);
 		}
 		Contact contact = (Contact) contactEle.getValue();
 		return contact;
 	}
 }

when I started 1,000 users to getContact() simultaneously,things happened.

then,
I try adding synchronized(cacheManager) { } around the code in this method getContact(),and it works smothly now,but i want to know why this happened.

source modified:

Code:

 package com.pqs.Contact;
 
 import net.sf.ehcache.Cache;
 import net.sf.ehcache.CacheManager;
 import net.sf.ehcache.Element;
 
 public class ContactProvider {
 	public CacheManager cacheManager=new CacheManager();
 	private Cache contactCache=null;
 	   /**
      * Returns the Contact for the given username.
      *
      * @param username the username to search for.
      * @return the contact associated with the ID.
      * @throws com.pqs.Contact.UserNotFoundException if the ID does not correspond
      *         to a known entity on the server.
      */
     public Contact getContact(String username) throws UserNotFoundException {
     	synchronized(cacheManager){
 			if (contactCache == null) {
 				cacheManager.addCache("contact");
 				contactCache = cacheManager.getCache("contact");
 			}
 			if (contactCache == null) {
 				throw new UserNotFoundException("Could not load caches");
 			}
 			Element contactEle = contactCache.get(username);
 	
 			if (contactEle == null) {
 				Contact contact = new Roster(username);
 				contactEle = new Element(username, contact);
 				contactCache.put(contactEle);
 			} else {
 				System.out.println("getRoster from cache:" + username);
 			}
 			Contact contact = (Contact) contactEle.getValue();
 			return contact;
     	}
 	}
 }

How do you start the 1000 threads? On 1 JVM? On many?

You have a write lock on getContact() so if you are starting the threads across JVMs, they will all contend with each other to acquire the write lock.

How do you generate load on this? How many of the 1K threads are getting a cache miss and having to generate a contact? And how many are simply reading an already-loaded contact?

--Ari

The request threads from other 2 JVMs.
All requests are simply reading an already loaded contact which is composed of a huge xml stanza.

What's your tc server doing when your client gets a broken pipe? Do your server logs show an OOME or assert and exit perchance?

Can you reproduce this / do you still have the server logs? Please attach them here. You are doing something suspicious of the following sort:

1. You have too much parallel load going at your application from a single JVM (2 JVMs each with 500 - 1000 threads is what I think you are doing, no?)
2. you have not tuned your app to handle the payloads you are sending through Terracotta. You might be running out of memory, especially if your updates are large enough.

I also think that you are asking why your threads block when you get broken pipe. The threads will block while a TC server is not available. This is why you always run 2 TC servers in active / passive mode in production. In your case, the client and tc server connection gets severed, simulating a TC outage and then all your threads trying to write will block. I wouldn't worry about this issue till we get through why you are getting the broken pipe.

--Ari

BTW, have you done the simple arithmetic of:

XML_Payload x number of parallel threads = total number of bytes sent / second to TC.

If that # is > 1Gbit / second I seriously doubt your test will ever succeed, till you bring more machines into the mix. There are ways to get more than 1Gbit / sec but it will take much more work than this simple test I think you are running.

What is that arithmetic in your test, please? KBytes / sec? MBytes / sec? Gigabytes / sec?

--Ari