novyden: October 2011

Sunday, October 23, 2011

AutoComplete Using Google App Engine and YUI in two parts (part 2)

Part 2: Autocomplete with YUI3 and GAE RPC Handlers

This is second and final part of 2-part series on implementing AutoComplete with GAE. Recall that in part 1 we built a foundation for keyword autocomplete lookup service for our document search application. Both the service itself and its HTML/JavaScript client will materialize below.

Let's for a moment switch to JavaScript side where I intend to use YUI3 AutoComplete. It supports variety of sources to query for available auto-complete choices including XHR (XMLHttpRequest) style and JSONP style URL sources. While working within bounds of the same application XHR URL will suffice the simplicity of both YUI widget and GAE RPC service support will let us do both with almost no extra work (JSONP service allows access from third-party web site pages which should not be taken lightly for security concerns).

The choice of YUI3 widget versus other libraries such as jQuery with its Autocomplete plugin is not important as one can swap plugins with few lines of JavaScript. YUI3 Library offers rich set of built-in options, wide variety of other compatible widgets, utilities, infrastructure and its API resembles jQuery now (I believe Yahoo stole - not copied - according to Picasso).

Great article by Paul Peavyhouse contains building blocks for the RPC handlers in GAE. We begin with RPCHandler class:

class RPCHandler(webapp.RequestHandler):
    """ Allows the functions defined in the RPCMethods class to be RPCed."""

    def __init__(self):
        webapp.RequestHandler.__init__(self)
        self.methods = RPCMethods()

    def get(self):
        func = None

        action = self.request.get('action')
        if action:
            if action[0] == '_':
                self.error(403) # access denied
                return
            else:
                func = getattr(self.methods, action, None)

        if not func:
            self.error(404) # file not found
            return        

        args = ()
        while True:
            key = 'arg%d' % len(args)
            val = self.request.get(key)
            if val:
                args += (val,)
            else:
                break
            
        # Checking if result is cached
        cache_key = action + ';' + ';'.join(args)
        result = memcache.get(cache_key) #@UndefinedVariable
        
        # Query if it's not
        if result is None:
            result = func(*args)
            memcache.add(cache_key, result, 900) #@UndefinedVariable
            
        return_data = self.prepare_result(result)
        self.response.out.write(return_data)
        self.response.headers['Content-Type'] = "application/json"

Actually RPCHandler takes care of roughly 90% of the job:

it retrieves action from request and matches it to appropriate RPC method from RPCMethods class via reflection (lines: 4-6 and 9-21)
it extracts service parameters from the request (parameter names matching argN) to pass to RPC method (lines: 23-30)
it forms a key to cache this call and checks if it's already available from memcache (lines: 32-43)
it calls RPC method and saves results in cache (lines: 36-39)
it formats results and sends them back to a client (lines: 41-43)

RPCHandler is an abstract class - concrete handlers extend using template method pattern : single abstract method prepare_result lets us have both XHR and JSONP style handlers:

class JSONPHandler(RPCHandler):
    
    def prepare_result(self, result):
        callback_name = self.request.get('callback')
        json_data = simplejson.dumps(result)
        return_data = callback_name + '(' + json_data + ');'
        return return_data
    
class XHRHandler(RPCHandler):
    
    def prepare_result(self, result):
        json_data = simplejson.dumps(result)
        return json_data

While XHRHandler formats data in JSON, JSONPHandler adds callback function to reply as expected by JSONP client (on top of generated JSON). Django provided simplejson encoder implementation imported from django.utils is part of App Engine environment.

With RPC plumbing done class RPCMethods does actual work: its method for keyword autocomplete action is ac_keywords (later you can offer more services by adding methods in RPCMethods):

class RPCMethods:
    
    def ac_keywords(self, *args):
        prefix = args[0]
        limit = int(args[1])
        
        query = Query(Keyword, keys_only=True)        
        query.filter('words >=', prefix)
        query.filter('words <=', unicode(prefix) + u"\ufffd")
        
        keyword_keys = query.fetch(limit, 0)
        result = map(lambda x: x.name(), keyword_keys) 
        return result

The method ac_keywords executes a search that matches all keywords starting with prefix and returns normalized version of corresponding keyword using retrieved key. In first part we called this approach embedded RIE exactly for this reason: retrieving key as data using search over string list property.
Now that everything is ready on GAE side (well, almost: last piece of code I left for the very end), we can implement take care of the html inside the browser. I start with defining a form containing input field to enter keywords:

<form method="post" action="#" id="search_form">
<p>
<input id="search_field" class="search" type="text" name="search_field" 
               value="Enter keywords..." 
               onfocus="if(!this._haschanged){this.value=''};this._haschanged=true;"/>
        <input name="search" type="image" src="images/search.png" 
               alt="Search" title="Search" />
    </p>
</form>

New empty input box will contain phrase Enter keywords... that disappears as soon as user focuses on the filed:

With auto-complete enabled it will look like this:

Adding YUI3 AutoComplete plugin is just few lines of JavaScript that also include extra customization to control highlighting, filtering, and delimiter for matching words (far from all options available to tune this plugin for one's needs):

...

I used queryDelimiter to activate autocomplete for each word user enters: feel free to play and change these and other attributes plugin offers. The line with source that is commented out defines source URL for JSONP style service, while the line with source left active is for XHR URL.

Finally, last piece of server-side Python code that enables both URLs using webapp framework for Python (in main.py):

application = webapp.WSGIApplication(
            [( '/',                 MainHandler),
             ...
             ( '/rpc.jsonp',        JSONPHandler),
             ( '/rpc.xhr',          XHRHandler)
            ])
    
run_wsgi_app(application)

I wanted to finish emphasizing how efficient this solution is. YUI3 AutoComplete plugin caches responses from JSONP and XHR URL sources automatically based on the query value for the duration of the pageview. Python RPC services implemented in GAE use memcache automatically and transparently to cache results for each action type and query value. Finally, querying for matching keywords in the datastore uses key only queries which are least expensive. Given that autocomplete feature on a busy site must be quite popular all these will contribute to both performance and savings on GAE.

Friday, October 21, 2011

AutoComplete Using Google App Engine and YUI in two parts (part 1)

Part 1: Embedded Relation Index Entity

I continue series of posts about keyword-based searches on Google App Engine (GAE) using Relation Index Entity (see RIE with Java and REI with Python posts in this order). Having implemented efficient search on GAE let's switch the focus to usability. When user searches for a word which is not one of the indexed keywords our search will yield no results. To help user be more efficient searching for documents we can introduce auto-complete pattern looking like this in a browser:

With usability we reduce number of RIE searches that yield no results (since user can still enter arbitrary words ignoring autocomplete) which helps GAE bill. It is a win-win if we do it right.

First, let's build a foundation: searchable list of all keywords. Existing RIE is of limited use as it is designed to search for documents by keywords - not for keywords themsevles. Thus we need new entity to store unique keywords:

class Keyword(db.Model):
    keyword = db.StringProperty()

Let's plug it in where we build document RIE:

doc, keywords = db.run_in_transaction(add_document, title, authors, publisher, tags)
            
for keyword in keywords:
    keyword_entity = Keyword(key_name=keyword.lower(), keyword=keyword.lower())
    keyword_entity.put()

Compare this to the original post to notice additional return value for keywords in add_document:

def add_document(title, authors, publisher, tags):
     
    # the same code
     
    return (doc, keywords)

Code that stores keywords in Keyword entity is not optimized as it saves existing keywords over and over again: you may want to improve it appropriately by reading it first or using memcache caching system.

The Keyword entity is very simple but worth noting that it has a key name (not id) equal to normalized keyword. The only string property it has is a normalized version of keyword - the one that is used in all searches. The normalization we use is just a lower-casing while more robust version would feature unicode normalization (e.g. removing accents), replacement of standard abbreviations (such as St.), stripping off characters (such as ' or -) and even synonyms.

Unless user enters normalized version of keyword autocomplete will display nothing. We can choose to normalize prefix string before querying as a simplest approach but I chose different solution (partially for demonstration purpose but also because it takes care of arbitrary normalization algorithms). The solution is to use Keyword as embedded Relation Index Entity: its key name having data and its field being index (in standard RIE data is in parent entity and index is a child entity, remember Document and DocumentKeywords?). This change should go a long way when we introduce more elaborate normalization algorithms as number of words that normalize down to the same keyword will grow. So Keyword entity gets its own StringListProperty to store non-normalized words corresponding to the same keyword (plus normalized version of course):

class Keyword(db.Model):
    words = db.StringListProperty()

with populating it like this:

doc, keywords = db.run_in_transaction(add_document, title, authors, publisher, tags)
            
for keyword in keywords:
                normalized = normalize_keyword(keyword)
                keyword_entity = Keyword.get_by_key_name(normalized)   
 
                if keyword_entity is None: # new keyword entity 
                    keyword_entity = Keyword(key_name=normalized, 
                                             words=list(Set([normalized, keyword])))
                else:
                    if (not keyword in keyword_entity.words):
                        keyword_entity.words.append(keyword)
                    else:
                        keyword_entity = None # no update necessary
                          
                if keyword_entity is not None: # save new or updated keyword
                    keyword_entity.put()

Our normalization is still the same but is factored out as it is expected to get more complex with time:

def normalize_keyword(keyword):
    return keyword.lower()

So how would we search keywords given few characters of user input? It would look something like this:

query = Query(Keyword, keys_only=True)        
query.filter('words >=', term)
query.filter('words <=', unicode(term) + u"\ufffd")
        
keyword_keys = query.fetch(20, 0)
result = map(lambda x: x.name(), keyword_keys)

At this point we have keywords ready to be served with autocomplete plugin on a client. In part 2 we will take care of the browser with YUI3 AutoComplete Plugin and AJAX for both XHR and JSONP URL style RPC services sprinkled with memcache.

novyden

JavaScript Loaders