Because best way to understand is explaining.

Friday, October 21, 2011

AutoComplete Using Google App Engine and YUI in two parts (part 1)

Part 1: Embedded Relation Index Entity


I continue series of posts about keyword-based searches on Google App Engine (GAE) using Relation Index Entity (see RIE with Java and REI with Python posts in this order). Having implemented efficient search on GAE let's switch the focus to usability. When user searches for a word which is not one of the indexed keywords our search will yield no results. To help user be more efficient searching for documents we can introduce auto-complete pattern looking like this in a browser:


With usability we reduce number of RIE searches that yield no results (since user can still enter arbitrary words ignoring autocomplete) which helps GAE bill. It is a win-win if we do it right.

First, let's build a foundation: searchable list of all keywords. Existing RIE is of limited use as it is designed to search for documents by keywords - not for keywords themsevles. Thus we need new entity to store unique keywords:

class Keyword(db.Model):
    keyword = db.StringProperty()

Let's plug it in where we build document RIE:

doc, keywords = db.run_in_transaction(add_document, title, authors, publisher, tags)
            
for keyword in keywords:
    keyword_entity = Keyword(key_name=keyword.lower(), keyword=keyword.lower())
    keyword_entity.put()

Compare this to the original post to notice additional return value for keywords in add_document:

def add_document(title, authors, publisher, tags):
     
    # the same code
     
    return (doc, keywords)

Code that stores keywords in Keyword entity is not optimized as it saves existing keywords over and over again: you may want to improve it appropriately by reading it first or using memcache caching system.

The Keyword entity is very simple but worth noting that it has a key name (not id) equal to normalized keyword. The only string property it has is a normalized version of keyword - the one that is used in all searches. The normalization we use is just a lower-casing while more robust version would feature unicode normalization (e.g. removing accents), replacement of standard abbreviations (such as St.), stripping off characters (such as ' or -) and even synonyms.

Unless user enters normalized version of keyword autocomplete will display nothing. We can choose to normalize prefix string before querying as a simplest approach but I chose different solution (partially for demonstration purpose but also because it takes care of arbitrary normalization algorithms). The solution is to use Keyword as embedded Relation Index Entity: its key name having data and its field being index (in standard RIE data is in parent entity and index is a child entity, remember Document and DocumentKeywords?). This change should go a long way when we introduce more elaborate normalization algorithms as number of words that normalize down to the same keyword will grow. So Keyword entity gets its own StringListProperty to store non-normalized words corresponding to the same keyword (plus normalized version of course):

class Keyword(db.Model):
    words = db.StringListProperty()

with populating it like this:

doc, keywords = db.run_in_transaction(add_document, title, authors, publisher, tags)
            
for keyword in keywords:
                normalized = normalize_keyword(keyword)
                keyword_entity = Keyword.get_by_key_name(normalized)   
 
                if keyword_entity is None: # new keyword entity 
                    keyword_entity = Keyword(key_name=normalized, 
                                             words=list(Set([normalized, keyword])))
                else:
                    if (not keyword in keyword_entity.words):
                        keyword_entity.words.append(keyword)
                    else:
                        keyword_entity = None # no update necessary
                          
                if keyword_entity is not None: # save new or updated keyword
                    keyword_entity.put()

Our normalization is still the same but is factored out as it is expected to get more complex with time:

def normalize_keyword(keyword):
    return keyword.lower() 

So how would we search keywords given few characters of user input? It would look something like this:

query = Query(Keyword, keys_only=True)        
query.filter('words >=', term)
query.filter('words <=', unicode(term) + u"\ufffd")
        
keyword_keys = query.fetch(20, 0)
result = map(lambda x: x.name(), keyword_keys) 
At this point we have keywords ready to be served with autocomplete plugin on a client. In part 2 we will take care of the browser with YUI3 AutoComplete Plugin and AJAX for both XHR and JSONP URL style RPC services sprinkled with memcache.

No comments: