Sunday, October 23, 2011

AutoComplete Using Google App Engine and YUI in two parts (part 2)

Part 2: Autocomplete with YUI3 and GAE RPC Handlers


This is second and final part of 2-part series on implementing AutoComplete with GAE. Recall that in part 1 we built a foundation for keyword autocomplete lookup service for our document search application. Both the service itself and its HTML/JavaScript client will materialize below.

Let's for a moment switch to JavaScript side where I intend to use YUI3 AutoComplete. It supports variety of sources to query for available auto-complete choices including XHR (XMLHttpRequest) style and JSONP style URL sources. While working within bounds of the same application XHR URL will suffice the simplicity of both YUI widget and GAE RPC service support will let us do both with almost no extra work (JSONP service allows access from third-party web site pages which should not be taken lightly for security concerns).

The choice of YUI3 widget versus other libraries such as jQuery with its Autocomplete plugin is not important as one can swap plugins with few lines of JavaScript. YUI3 Library offers rich set of built-in options, wide variety of other compatible widgets, utilities, infrastructure and its API resembles jQuery now (I believe Yahoo stole - not copied - according to Picasso).

Great article by Paul Peavyhouse contains building blocks for the RPC handlers in GAE. We begin with RPCHandler class:

class RPCHandler(webapp.RequestHandler):
    """ Allows the functions defined in the RPCMethods class to be RPCed."""

    def __init__(self):
        webapp.RequestHandler.__init__(self)
        self.methods = RPCMethods()

    def get(self):
        func = None

        action = self.request.get('action')
        if action:
            if action[0] == '_':
                self.error(403) # access denied
                return
            else:
                func = getattr(self.methods, action, None)

        if not func:
            self.error(404) # file not found
            return        

        args = ()
        while True:
            key = 'arg%d' % len(args)
            val = self.request.get(key)
            if val:
                args += (val,)
            else:
                break
            
        # Checking if result is cached
        cache_key = action + ';' + ';'.join(args)
        result = memcache.get(cache_key) #@UndefinedVariable
        
        # Query if it's not
        if result is None:
            result = func(*args)
            memcache.add(cache_key, result, 900) #@UndefinedVariable
            
        return_data = self.prepare_result(result)
        self.response.out.write(return_data)
        self.response.headers['Content-Type'] = "application/json"

Actually RPCHandler takes care of roughly 90% of the job:
  • it retrieves action from request and matches it to appropriate RPC method from RPCMethods class via reflection (lines: 4-6 and 9-21)
  • it extracts service parameters from the request (parameter names matching argN) to pass to RPC method (lines: 23-30)
  • it forms a key to cache this call and checks if it's already available from memcache (lines: 32-43)
  • it calls RPC method and saves results in cache (lines: 36-39)
  • it formats results and sends them back to a client (lines: 41-43)

RPCHandler is an abstract class - concrete handlers extend using template method pattern : single abstract method prepare_result lets us have both XHR and JSONP style handlers:

class JSONPHandler(RPCHandler):
    
    def prepare_result(self, result):
        callback_name = self.request.get('callback')
        json_data = simplejson.dumps(result)
        return_data = callback_name + '(' + json_data + ');'
        return return_data
    
class XHRHandler(RPCHandler):
    
    def prepare_result(self, result):
        json_data = simplejson.dumps(result)
        return json_data

While XHRHandler formats data in JSON, JSONPHandler adds callback function to reply as expected by JSONP client (on top of generated JSON). Django provided simplejson encoder implementation imported from django.utils is part of App Engine environment.

With RPC plumbing done class RPCMethods does actual work: its method for keyword autocomplete action is ac_keywords (later you can offer more services by adding methods in RPCMethods):

class RPCMethods:
    
    def ac_keywords(self, *args):
        prefix = args[0]
        limit = int(args[1])
        
        query = Query(Keyword, keys_only=True)        
        query.filter('words >=', prefix)
        query.filter('words <=', unicode(prefix) + u"\ufffd")
        
        keyword_keys = query.fetch(limit, 0)
        result = map(lambda x: x.name(), keyword_keys) 
        return result
The method ac_keywords executes a search that matches all keywords starting with prefix and returns normalized version of corresponding keyword using retrieved key. In first part we called this approach embedded RIE exactly for this reason: retrieving key as data using search over string list property.
Now that everything is ready on GAE side (well, almost: last piece of code I left for the very end), we can implement take care of the html inside the browser. I start with defining a form containing input field to enter keywords:
<form method="post" action="#" id="search_form">
<p>
<input id="search_field" class="search" type="text" name="search_field" 
               value="Enter keywords..." 
               onfocus="if(!this._haschanged){this.value=''};this._haschanged=true;"/>
        <input name="search" type="image" src="images/search.png" 
               alt="Search" title="Search" />
    </p>
</form>
New empty input box will contain phrase Enter keywords... that disappears as soon as user focuses on the filed:

With auto-complete enabled it will look like this:


Adding YUI3 AutoComplete plugin is just few lines of JavaScript that also include extra customization to control highlighting, filtering, and delimiter for matching words (far from all options available to tune this plugin for one's needs):

...

I used queryDelimiter to activate autocomplete for each word user enters: feel free to play and change these and other attributes plugin offers. The line with source that is commented out defines source URL for JSONP style service, while the line with source left active is for XHR URL.

Finally, last piece of server-side Python code that enables both URLs using webapp framework for Python (in main.py):

application = webapp.WSGIApplication(
            [( '/',                 MainHandler),
             ...
             ( '/rpc.jsonp',        JSONPHandler),
             ( '/rpc.xhr',          XHRHandler)
            ])
    
run_wsgi_app(application)
 

I wanted to finish emphasizing how efficient this solution is. YUI3 AutoComplete plugin caches responses from JSONP and XHR URL sources automatically based on the query value for the duration of the pageview. Python RPC services implemented in GAE use memcache automatically and transparently to cache results for each action type and query value. Finally, querying for matching keywords in the datastore uses key only queries which are least expensive. Given that autocomplete feature on a busy site must be quite popular all these will contribute to both performance and savings on GAE.

Friday, October 21, 2011

AutoComplete Using Google App Engine and YUI in two parts (part 1)

Part 1: Embedded Relation Index Entity


I continue series of posts about keyword-based searches on Google App Engine (GAE) using Relation Index Entity (see RIE with Java and REI with Python posts in this order). Having implemented efficient search on GAE let's switch the focus to usability. When user searches for a word which is not one of the indexed keywords our search will yield no results. To help user be more efficient searching for documents we can introduce auto-complete pattern looking like this in a browser:


With usability we reduce number of RIE searches that yield no results (since user can still enter arbitrary words ignoring autocomplete) which helps GAE bill. It is a win-win if we do it right.

First, let's build a foundation: searchable list of all keywords. Existing RIE is of limited use as it is designed to search for documents by keywords - not for keywords themsevles. Thus we need new entity to store unique keywords:

class Keyword(db.Model):
    keyword = db.StringProperty()

Let's plug it in where we build document RIE:

doc, keywords = db.run_in_transaction(add_document, title, authors, publisher, tags)
            
for keyword in keywords:
    keyword_entity = Keyword(key_name=keyword.lower(), keyword=keyword.lower())
    keyword_entity.put()

Compare this to the original post to notice additional return value for keywords in add_document:

def add_document(title, authors, publisher, tags):
     
    # the same code
     
    return (doc, keywords)

Code that stores keywords in Keyword entity is not optimized as it saves existing keywords over and over again: you may want to improve it appropriately by reading it first or using memcache caching system.

The Keyword entity is very simple but worth noting that it has a key name (not id) equal to normalized keyword. The only string property it has is a normalized version of keyword - the one that is used in all searches. The normalization we use is just a lower-casing while more robust version would feature unicode normalization (e.g. removing accents), replacement of standard abbreviations (such as St.), stripping off characters (such as ' or -) and even synonyms.

Unless user enters normalized version of keyword autocomplete will display nothing. We can choose to normalize prefix string before querying as a simplest approach but I chose different solution (partially for demonstration purpose but also because it takes care of arbitrary normalization algorithms). The solution is to use Keyword as embedded Relation Index Entity: its key name having data and its field being index (in standard RIE data is in parent entity and index is a child entity, remember Document and DocumentKeywords?). This change should go a long way when we introduce more elaborate normalization algorithms as number of words that normalize down to the same keyword will grow. So Keyword entity gets its own StringListProperty to store non-normalized words corresponding to the same keyword (plus normalized version of course):

class Keyword(db.Model):
    words = db.StringListProperty()

with populating it like this:

doc, keywords = db.run_in_transaction(add_document, title, authors, publisher, tags)
            
for keyword in keywords:
                normalized = normalize_keyword(keyword)
                keyword_entity = Keyword.get_by_key_name(normalized)   
 
                if keyword_entity is None: # new keyword entity 
                    keyword_entity = Keyword(key_name=normalized, 
                                             words=list(Set([normalized, keyword])))
                else:
                    if (not keyword in keyword_entity.words):
                        keyword_entity.words.append(keyword)
                    else:
                        keyword_entity = None # no update necessary
                          
                if keyword_entity is not None: # save new or updated keyword
                    keyword_entity.put()

Our normalization is still the same but is factored out as it is expected to get more complex with time:

def normalize_keyword(keyword):
    return keyword.lower() 

So how would we search keywords given few characters of user input? It would look something like this:

query = Query(Keyword, keys_only=True)        
query.filter('words >=', term)
query.filter('words <=', unicode(term) + u"\ufffd")
        
keyword_keys = query.fetch(20, 0)
result = map(lambda x: x.name(), keyword_keys) 
At this point we have keywords ready to be served with autocomplete plugin on a client. In part 2 we will take care of the browser with YUI3 AutoComplete Plugin and AJAX for both XHR and JSONP URL style RPC services sprinkled with memcache.

Friday, September 23, 2011

One more time: Relation Index Entities with Python for Google Datastore

Hey, why not? Relation Index Entities implemented in Python should nicely complement my RIE with Java/Objectify post. Because I follow the same example and concepts I will not go over them twice but rather concentrate on specifics of Python example (again, please see RIE with Java/Objectify post for general concepts).

I start with defining datastore entities in Python:
class Document(db.Model):
    title = db.StringProperty()
    abstract = db.StringProperty()
    authors = db.StringListProperty()
    publisher = db.StringProperty()
    tags = db.StringListProperty()
    

class DocumentKeywords(db.Model):
    keywords = db.StringListProperty()

As before list property keywords of DocumentKeywords is a critical element of RIE design. The brevity of Python implementation is striking when comparing entity definitions with Java/Objectify version. For example, there is nothing that indicates these 2 entities comprise an entity group. The entity classes as defined may or may not belong to the same entity group: during construction of concrete instances we can establish optional parent-child relationship.

Let’s see how we could add new Document with Python:

def add_document(title, authors, publisher, tags):
    
    doc = Document()
    doc.title = title
    doc.authors = authors
    doc.publisher = publisher
    doc.tags = tags
    
    doc.put()
    
    keywords = []
    keywords.append(doc.title)
    keywords.extend(doc.authors)
    keywords.append(doc.publisher)
    keywords.extend(doc.tags)
    
    doc_keywords = DocumentKeywords(parent=doc, keywords=keywords)
    doc_keywords.put() 
    
    return doc


By constructing DocumentKeyword with parent argument it becomes part of the document entity group so we can call add_document within a transaction:

doc = db.run_in_transaction(add_document, title, authors, publisher, tags)
  

Of course I oversimplified this example as real keywords would have been treated more carefully with consistent case, stop word filtering, robust parsing, etc. But this is not the goal of this exercise so I am leaving it all out.

Finally, the keyword search method:

def find_by_keywords(keywords):
        
        query = reduce(lambda x, y : x.filter('keywords =', y), 
                       keywords.insert(0, db.Query(DocumentKeywords, keys_only=True)))
        
        keywords_keys = query.fetch(100, 0)
        
        doc_keys = map(lambda x:x.parent(), keywords_keys)
        docs = db.get(doc_keys)
        
        return docs

Again as in Java/Objectify version it is a 3-step process. First step (line 3-4) builds a query reducing the list of keywords on a query object (AND condition only). The query built is for keys only as we are never interested in anything else from DocumentKeywords. Second step (line 6) uses the query to retrieve keys only (keys of DocumentKeywords that is). And lastly (lines 8-9) we map retrieved keys to parent Document keys and retrieve desired documents with batch get.

I find this Python solution more elegant and concise than Java/Objectify code. And it is clear that Java code is saved by Objectify especially the query part where all datastore API features are clearly defined and easily available.

Some questions arise when discussing Relation Index Entity pattern (RIE) that are about its feasibility in real applications. Performance issues associated with creating and maintaining such index are common ones as well as its applicability in modeling data (for example see this post by Jeff Schnitzer of Objectify fame).

In my view the effectiveness of RIE depends on how this index is used (just like any index in general). If its usage is in order of 100s or more reads than updates then it could be feasible. The higher the ratio the better. Thus for mostly static data such as document library or other references near free text search implemented with RIE is definitely option to consider. Of course the ultimate test should be a GAE bill that reflects resources spent on both queries and updates and where alternatives stand against this bill.

The other point is that RIE should not play any part in the data model. Effectively it is an extension of indexes generated by datastore and should be treated as such.

And lastly there is clearly high initial cost of creating RIE index for datastore artifacts (like documents). I can recommend using task queues, mapreduce and/or blobstore and do as much pre-processing locally as possible before loading data into GAE. For example, document keywords can be extracted and processed on your own server before uploading them to datastore. That way you can end up with simple CSV file to process on GAE.

Thursday, September 1, 2011

Groovy. Batch. Prepared statement. Nice!

Scripting with Groovy is exciting thing. You get a feeling that the language was inspired by an Oracle who read your mind and then made it 10 times better. So imagine how I felt after finding out that Groovy can NOT batch prepared statements.

Batching sql updates, inserts or deletes is one of the top features that database scripts would need. Without prepared statements I had to resort to generating SQL with GString:
sql.withBatch { stmt ->
          mymap.each { k,v ->
              stmt.addBatch("""UPDATE some_table 
                                  SET some_column = '${v}' 
                                WHERE id = ${k} """)
          }
}
Besides SQL injection this presents the problem of escaping strings in SQL: big pain in some cases. By arguing that injection is not an issue for internal script (it's not in a wild on the web after all) you would leave yourself with the loop hole anyway. Don't forget about performance. The bottom line: I need support for prepared statements!

I would have to stop here joining ranks of complaints like this if not for Groovy 1.8.1. This latest stable version (as of today) addresses bunch of bugs and just couple of features. And one of two is batch support for prepared statements. Below is secure and reliable (as well as more readable) version with batch support for prepared statement in 1.8.1:
sql.withBatch(20, """UPDATE some_table 
                        SET some_column = ? 
                      WHERE id = ? """) { ps ->   
              
          mymap.each { k,v ->
              ps.addBatch(v, k)
          }
}

You can find more options on how use batching with prepared statements in Groovy 1.8.1 docs.

Friday, August 5, 2011

Java Anti-Pattern: Constructors and Template Method Pattern

My love for template method pattern hit the wall. The wall of Java constructors that is. Do not mix them together ever or at least without double checking.

Again, I am all for template method pattern when it operates on fully initialized objects. But, by definition, Java constructors do not operate on such objects: they are exactly in the business of initializing them.

Imagine that your template method calls on an object that is still under construction - not something proponents of template method had in mind. But this is exactly what happens with Java constructors. Let's go straight to the example.

Suppose we have an abstract class Vehicle:
public abstract class Vehicle {
    
    private boolean registered = false;

    public Vehicle() {
        registered = registerWithDMV(getMileage());
    }
    
    public abstract int getMileage();
    
    private boolean registerWithDMV(int mileage) {
        return (registered = (mileage > 0) ? true : false);
    }
    
    public boolean isRegistered() {
        return registered;
    }
}
Every vehicle registers with DMV when created. Its constructor (using template method pattern) calls concrete method registerWithDMV and abstract method getMileage. So concrete sub-classes of the Vehicle must provide mileage:
public class Car extends Vehicle { 
    
    private int mileage = 6;

    @Override
    public int getMileage() {
        return mileage;
    }
}
Our implementation for DMV registration is part of Vehicle for demonstration only. It is simple: give me non-zero mileage and you are good to go (drive that is). But alas, look at this test that promptly fails:
public class VehicleTests {
    
    @Test
    public void testNewCar() {
        Vehicle car = new Car();
        assertTrue(car.isRegistered());
    }
}
Unfortunately our template method (Vehicle default constructor) runs when Car object is not yet fully initialized: Java constructors run in order from higher in hierarchy (abstract) to lower (concrete) classes. That is why property mileage is still 0 and not 6 when constructor is run.

My first recommendation is not to use template method in constructors in Java at all. This is rather drastic but doable. Replace it with some init method that is called by concrete classes upon creation. If you don't like radical approaches then use lazy initialization and/or static variables in concrete classes:
public class Car extends Vehicle {
    
    private static int INITIAL_MILEAGE = 6;
    private int mileage;

    @Override
    public int getMileage() {
        if (mileage == 0) {
            mileage = INITIAL_MILEAGE;
        }
        return mileage;
    }
}
Test succeeds now. But lazy initialization may need more work: assuming mileage 0 is valid value you would end up introducing yet another property (flag) to indicate if mileage is initialized or not.

Again, my preference is avoiding this conflict all together by placing object initialization in the constructor of concrete class. Just think of Java constructors as non-polymorphic hierarchical artifacts.

Wednesday, July 13, 2011

Quick Start with Mercurial and Bitbucket Hosting

I started my project rather unprepared - no version control, no hosting. But eventually I did my homework. I am going to use Mercurial DVCS (distributed version control system) and Bitbucket to host Mercurial (Hg) repository. I am quite confident that if you choose Google code or another hosting for Hg repo you can adopt the steps below with minor changes.

Let's call our project demo. It sits in a folder demo on my local drive. All I want is to enable Mercurial and start syncing it with Bitbucket repository. To accomplish just that we need just few steps.

  1. Install Mercurial. Make sure Mercurial command hg is in your classpath.
  2. Go to your project folder demo and run from command line:
    hg init
    You initiated brand new Mercurial repository that will contain files from demo project in a couple of steps.
  3. Create .hgignore file in demo folder to prevent Mercurial from adding files that don't belong to source control: build (or target with maven) folder, compiled classes, etc. For example see here.
  4. From command line in demo folder run:
    hg add
    You added all your files to Mercurial repository with exception of those identified by .hgignore.
  5. From command line in demo folder run:
    hg commit
    This is actual commit - without it those files that you added in the step before are just placeholders.
  6. We are done with initializing new Mercurial repository. If you don't want any hosting then you are done - your project is a Mercurial repository now. I recommend to read this to get started with Mercurial.

  7. Register on Bitbucket (Suppose you registered using name myusername).
  8. Create new repository and call it demo. This is your hosted Mercurial repository that you can access with url https://bitbucket.org/myusername/demo. Check with Bitbucket for exact URL.
  9. Update your remote repository (which is empty) with your existing demo files, from command line in demo folder run:
    hg push https://bitbucket.org/myusername/demo
  10. Save a push URL so that you don't need to enter it each time when you use hg push or hg outgoing commands. Locate hgrc file in demo/.hg folder (create if it doesn't exist) and add:
    [paths]
    default-push = https://myusername@bitbucket.org/myusername/demo
    

You've done it! Your lonely project just became both hosted (source code) and backed by enterprise-strength DVCS.

Sunday, June 19, 2011

Enhancing JUnit Suites with Categories to eliminate test-suite dependency

With JUnit 3 the extra step of creating and maintaining test suites on top of unit tests never felt right. With JUnit 4 test suites became just simple annotation boilerplate:
@RunWith(Suite.class)
@Suite.SuiteClasses({
  SomeTests.class,
  SomeOtherTests.class,
  SomethingElseTests.class
})
public class SomeTestSuite {
  // this class is just a place holder for test suite annotations above
}

Next logical step would be aggregation of tests based on the information contained in tests themselves. Instead of specifying concrete tests, test suites would contain qualifiers (using annotations) to match. Those tests having matching qualifiers are included in a suite, those without are not. For example I have a qualifier JMSTest that is obviously assigned to tests that use a JMS provider.

Thus, my test suite classes will become completely decoupled from the tests and vice verse. This is actually even better than it sounds: even though there is no language dependency from tests to test suites in JUnit, functional dependency does exist: tests will not run if they are not bound to one or more test suites.

With introduction of JUnit Categories we received a qualifier support that is almost what we need:

public interface JMSTest {}


@RunWith(Categories.class)
@Categories.IncludeCategory(JMSTest.class)
@Suite.SuiteClasses({
  SomeTestSuite.class,
})
public class JMSTestSuite {}


@Category(IntegrationTest.class)
public class SomeTests {
   ....
}


@Category(JMSTest.class) 
Public class SomeOtherTests{
   ....
}


@Category(DatabaseTest.class) 
Public class SomethingElseTests{
   ....
}

This setup will run SomeOtherTests marked with JMSTest category when running JMSTestSuite. But categories didn’t eliminate the dependency from suites to tests – we still depend on test suite SomeTestSuite that explicitly references our tests.

Now, imagine you can define suite AllProjectTests that always contains all tests from the project. Then you can define category-based test suites like JMSTestSuite above and never care about maintaining your test suites again. Fortunately, we can use open source project to do just that - ClasspathSuite:

import org.junit.extensions.cpsuite.ClasspathSuite;
import org.junit.runner.RunWith;
@RunWith(ClasspathSuite.class)
public class AllProjectTests {}


@RunWith(Categories.class)
@Categories.IncludeCategory(JMSTest.class)
@Suite.SuiteClasses({
  AllProjectTests.class,
})
public class JMSTestSuite {}

To summarize, if you want to define a test suite that runs all database-based tests:
1. Define AllProjectTests suite using ClasspathSuite.
2. Define JUnit category DatabaseTest.
3. Define corresponding suite DatabaseTestSuite to run all tests marked with category DatabaseTest.
4. More complex category-based suites are easy to construct with JUnit category support.

Wednesday, June 8, 2011

Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood)

Extracting and processing text from multiple sources (file formats) is the job Apache Tika does quite well. It abstracts you away from format internals and Tika's coverage (pdf, MS Office, graphics, audio, video, etc.) is superb. Tika doesn't implement actual parsers - instead it offers uniform API to access other parsers for supported document types (all you need is to implement SAX parser ContentHandler - see here). Indeed Tika utilizes PDFBox internally for pdf files. Nothing prevents you from concentrating on parsing extracted text and converting it to information. Except when little details begin to matter.

Sometimes extracting and processing text depends on order of lines in the document: for example headers carry over additional information for line items below them. By default, Tika parser (actually PDFBox parser that does it in case of pdf) will not keep the order when stripping the text out (see org.apache.pdfbox.util.PDFTextStripper and its property sortByPosition). Thus, some headers may occur after its line items are fed to a handler. Apache PDFBox explains that performance is better when order is not preserved. Tika as of version 0.9 doesn't let you control this behavior in PDFBox (they plan addressing it in 1.0 release - see TIKA-100 and TIKA-612).

Simple patch allows us to take advantage of sortByPosition property in PDFBox PDFTextStripper when using it with Tika: lines 6-10 below replaced setSortByPosition(true or false); and remember that PDF2XHTML (from Tika) extends PDFTextStripper (from PDFBox):
private PDF2XHTML(ContentHandler handler, Metadata metadata)
            throws IOException {
        this.handler = new XHTMLContentHandler(handler, metadata);
        setForceParsing(true);
        
        // CUSTOM CODE:
        String sortEnabled = metadata.get("org.apache.tika.parser.pdf.sortbyposition");
        if (sortEnabled != null) {
            setSortByPosition(sortEnabled.equalsIgnoreCase("true"));
        }
}
Now, you can control order in pdf parser like follows (there are no PDFBox classes in this code):
        
        Parser parser = new PDFParser();
        ContentHandler handler = new BodyContentHandler(myHandler);
        Metadata metadata = new Metadata();
        metadata.set("org.apache.tika.parser.pdf.sortbyposition", "false");
        ParseContext context = new ParseContext();

        try {
            parser.parse(instream, handler, metadata, context);
        } finally {
            instream.close();
        }

Of course, if you deal only with single file type (e.g. pdf) then it's easier to use dedicated library such as Apache PDFBox. Then my recommendation would be downloading Tika source code for real examples of PDFBox in action.

UPDATE
The issue is still present in Tika 1.0 with hard-coded setSortByPosition(false); in org.apache.tika.parser.pdf.PDF2XHTML.

The upcoming 1.1 release will add PDFParser.setSortByPosition method, so patch would be replaced with this:
      
        Parser parser = new PDFParser();
        parser.setSortByPosition(true); // or false
        ContentHandler handler = new BodyContentHandler(myHandler);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();

        try {
            parser.parse(instream, handler, metadata, context);
        } finally {
            instream.close();
        }

Thursday, March 3, 2011

Persistence with JPA and Hibernate using Guice 3.0 and guice-persist

Why do I like Google frameworks? Because they are academic-like: elegant, concise, focused, open to extension so they evolve gradually and naturally. Why I don't like Google frameworks? Because they are academic-like: barely documented, lack in visual design and appearance.

Google Guice is no exception. It's been around probably as long as Spring. Anyone looking for DI framework must give it a shot. But even today you hear Spring not Guice when people talk DI.

Using JPA with Spring before I firmly decided to stick with Guice this time. I have simple back-end program: no web, no application server (PostgreSQL database and Hibernate as a JPA 1.0 provider).

I decided to take advantage of the latest JPA support in Guice: guice-persist. Guice 3.0 is required (upgrade from 2.0 if necessary) but Guice jar doesn't contain guice-persist: have both dependencies in your pom.xml when using Maven (updated):
<dependency>
  <groupid>com.google.inject</groupId>
  <artifactid>guice</artifactId>
  <version>3.0</version>
</dependency>
<dependency>
  <groupid>com.google.inject.extensions</groupId>
  <artifactid>guice-persist</artifactId>
  <version>3.0</version>
</dependency>
Without Maven simply follow your regular practices to add jars above to Java classpath.

Existing Guice configuration using module(s) need not change, but add guice-persist module when creating injector. Before:
Injector injector = Guice.createInjector(new MyAppModule());
After:
Injector injector = Guice.createInjector(new MyAppModule(), 
                          new JpaPersistModule("myapp-db"));
myapp-db is a persistence unit defined in persistence.xml placed on classpath (e.g. in Maven project it's in src/main/resources/META-INF directory):

    
        org.hibernate.ejb.HibernatePersistence
        com.example.domain.MyEntity
        true
        
            
            
            
            
            
            
            
        
    

This is JPA 1.0 persistence unit - you should be able to use 2.0 without any problems. I disabled scanning of classpath for entities using exclude-unlisted-classes: all JPA entities should be listed with class element now. The properties are specific to Hibernate and PostgreSQL.

guice-persist works via PersistService that has to be started (initialized). I do it immediately after initializing injector and using injector:

Injector injector = Guice.createInjector(new MyAppModule(), 
                          new JpaPersistModule("myapp-db"));
injector.getInstance(ApplicationInitializer.class);
and
public class ApplicationInitializer {
 @Inject ApplicationInitializer(PersistService service) {
  service.start(); 
  // At this point JPA is started and ready.

  // other application initializations if necessary
 }
}

Persistence unit defined transaction-type="RESOURCE_LOCAL" to have JPA EntityManager created and destroyed for each database transaction. Define both EntityManager and transactions in DAO class with the following Guice annotations:
public class MyAppDAO {
    @Inject
    private EntityManager em;

    @Transactional
    public MyEntity find(long id) {
        return em.find(MyEntity.class, id);
    }

    @Transactional
    public void save(MyEntity entity) {
        em.persist(entity);
    }
}
We injected EntityManager with @com.google.inject.Inject and declared transactions with @com.google.inject.persist.Transactional annotations. Now, each time find or save method called new transaction is started and committed in JPA entity manager. For more details on transaction scope (unit of work) and exception handling see this.

Wrapping it up: there is bare minimum of artifacts/configuration needed on top of standard JPA persistence.xml: new Guice persistence module (provided by guice-persist), persistence service initialization, standard Guice injection of EntityManager, and new @Transactional annotation. Of course, your mileage may vary depending on your transactional (unit of work) needs but it's hard to imagine less configuration and code when implementing data access with JPA and Hibernate.

For completeness I add a Guice module here (it needs no special configuration for guice-persist though, so I have DAO configuration here only):
public class MyAppModule extends AbstractModule {
 
    @Override
    protected void configure() {
        ...
        bind(ISomeDao.class).to(MyDao.class);
    }
}
References:
Using JPA with Guice Persist
Hibernate with JPA Annotations and Guice

Tuesday, February 22, 2011

Efficient Keyword Search with Relation Index Entities and Objectify for Google Datastore

Free text search with keywords on Google App Engine datastore made simple - in fact simple enough to fit into single blog entry.

I will use GAE/Java with Objectify for datastore API (also see my newer post with Python implementation). Assume we maintain a document library where each document has several textual attributes: name, title, subtitle, authors, publisher, reference number (similar to ISBN), tags, abstract, etc. While each attribute is semantically different, for a searcher they all present some value (or relevance). Thus, user may search for any of them with one or more keywords. For simplification, I consider only AND searches.

First, let’s model our entities (remember, we use Objectify that in turn uses standard JPA annotations wherever possible):

@Entity(name = "Document")
public class Document {

  @Id
  private Long id;

  private String title;
  private List<String> authors = = new ArrayList<String>();
  private String publisher;
  private List<String> tags = new ArrayList<String>();
  // more attributes as necessary...

  public Document() {
    super();
  }

  // standard getters/setters follow...
}


One thing to emphasize is the use of list properties such as authors and tags. Datastore treats them as multi-valued attributes so that condition like authors == ‘John Doe’ would return all documents that have John doe as one of authors. This list property feature is critical in the next (and last) entity we define:

@Entity(name = "DocumentKeywords")
public class DocumentKeywords {

  @Id Long id;
  @Parent Key<Document> document;
  List<String> keywords = new ArrayList<String>();

  private DocumentKeywords() {
    super();
  }

  public DocumentKeywords (Key<Document> parent) {
    this(parent, Collections.<string>emptyList());
  }

  public DocumentKeywords (Key<Document> parent, Collection<String> keywords) {
    super();

    this. document = parent;
    this.keywords.addAll(keywords);
  }

  // add single keyword
  public boolean add(String keyword) {
    return keywords.add(keyword);
  }

  // add collection of keywords
  public boolean add(Collection<String> keywords) {
    return this.keywords.addAll(keywords);
  }
}


There are several things worth noting about DocumentKeywords.

First, it’s a child entity to Document (see @Parent annotation in Objectify). Parent Document and child DocumentKeywords make an entity group in datastore. This is important for data integrity – entity group rows can participate in transactions in datastore. Data integrity is critical in this case (you'll see shortly). Indeed, we'll duplicate attribute values between Document and DocumentKeywords. For each Document entity we create corresponding child DocumentKeywords to consolidate all document attributes into property keywords.

Secondly, keywords is a list property. List property is limited to 5000 entries which is often sufficient. And if it’s not we could add more DocumentKeywords child rows for the same Document parent (not implemented here).

Finally, what is DocumentKeywords entity defined for? Why is its keywords attribute not part of Document entity? The answer is in this Google IO presentation (Spoiler: Keywords being list property in Document would produce serialization overhead on Document entity (at least doubling it since it's exact copy of the rest of Document attributes). Moving keywords to separate entity is called Relation Index Entity and it gives us best of both worlds: fully indexed attributes (via list property) and no serialization overhead for documents.)

We add new Document and index document attributes in child DocumentKeywords in one transaction:

// we use DI to initialize factory (application scoped)
private final ObjectifyFactory factory;

private Objectify tran = null;

public Document addDocument(Document document) throws {

  try {
    Objectify ofy = beginTransaction();

    Key<Document> key = ofy.put(document);

    // crate Relation Index Entity
    DocumentKeywords rie = new DocumentKeywords(key);
    rie.add(document.getTitle());
    rie.add(document.getAuthors());
    rie.add(document.getPublisher());
    rie.add(document.getTags());
    ofy.put(rie);

    commit();

    Document savedDocument = beginQuery().find(key);
    return savedDocument;

  }finally {
    rollbackIfActive();
  }
}

I left out as an exercise transactional methods above. Just note important datastore gotcha: rows added inside transaction are not available within this transaction if it’s still active. If you add row and then try to query it before committing then you won’t find it. Commit transaction first and then read its data.

Now we are ready to make actual keyword searches:

public Collection<document> findByKeywords(Collection<String> keywords) {

  Objectify ofy = beginQuery();

  Query<DocumentKeywords> query = ofy.query(DocumentKeywords.class);
  for (String keyword : keywords) {
    query = query.filter("keywords", keyword);
  }

  Set<Key<Document>> keys = query.<Document>fetchParentKeys();

  Collection<Document> documents = ofy.get(keys).values();

  return documents;
}


You can see that keyword search is a 3-step process: during first step we iteratively build search condition (AND only), in second step we query DocumentKeywrods toretrieve keys only – no overhead of serialization bulky keywords here. And lastly we convert retrieved DocumentKeywords keys into parent keys (documents) and use datastore batch get to return them. Objectify made all steps quite transparent and efficient.

This is all to it. Let me make few comments about this example. It is purposely contrived but it should map to real cases with no principal changes. Documents could be friends in a social network, products from online retail catalog, or blog entries in blogging web site. I intentionally left document content out of the list of attributes. Current limit of datastore doesn’t allow me to build elegant and concise solution beyond 5000 thousand keywords per document so it makes inclusion of document content risky. Even though simple enhancement trounces this limitation I didn’t want to overload code above.

Extending to free text search would mean support for such features as word normalization and stemming, case-sensitivity, logical operations, keyword proximity (e.g. same attribute or related attributes), extending beyond datastore 5000 list property limit.

References:
1. Building Scalable, Complex Apps on App Engine
2. Datastore List Property
3. Stemming
4. Objectify
5. RIE with Python