Because best way to understand is explaining.

Friday, September 23, 2011

One more time: Relation Index Entities with Python for Google Datastore

Hey, why not? Relation Index Entities implemented in Python should nicely complement my RIE with Java/Objectify post. Because I follow the same example and concepts I will not go over them twice but rather concentrate on specifics of Python example (again, please see RIE with Java/Objectify post for general concepts).

I start with defining datastore entities in Python:
class Document(db.Model):
    title = db.StringProperty()
    abstract = db.StringProperty()
    authors = db.StringListProperty()
    publisher = db.StringProperty()
    tags = db.StringListProperty()
    

class DocumentKeywords(db.Model):
    keywords = db.StringListProperty()

As before list property keywords of DocumentKeywords is a critical element of RIE design. The brevity of Python implementation is striking when comparing entity definitions with Java/Objectify version. For example, there is nothing that indicates these 2 entities comprise an entity group. The entity classes as defined may or may not belong to the same entity group: during construction of concrete instances we can establish optional parent-child relationship.

Let’s see how we could add new Document with Python:

def add_document(title, authors, publisher, tags):
    
    doc = Document()
    doc.title = title
    doc.authors = authors
    doc.publisher = publisher
    doc.tags = tags
    
    doc.put()
    
    keywords = []
    keywords.append(doc.title)
    keywords.extend(doc.authors)
    keywords.append(doc.publisher)
    keywords.extend(doc.tags)
    
    doc_keywords = DocumentKeywords(parent=doc, keywords=keywords)
    doc_keywords.put() 
    
    return doc


By constructing DocumentKeyword with parent argument it becomes part of the document entity group so we can call add_document within a transaction:

doc = db.run_in_transaction(add_document, title, authors, publisher, tags)
  

Of course I oversimplified this example as real keywords would have been treated more carefully with consistent case, stop word filtering, robust parsing, etc. But this is not the goal of this exercise so I am leaving it all out.

Finally, the keyword search method:

def find_by_keywords(keywords):
        
        query = reduce(lambda x, y : x.filter('keywords =', y), 
                       keywords.insert(0, db.Query(DocumentKeywords, keys_only=True)))
        
        keywords_keys = query.fetch(100, 0)
        
        doc_keys = map(lambda x:x.parent(), keywords_keys)
        docs = db.get(doc_keys)
        
        return docs

Again as in Java/Objectify version it is a 3-step process. First step (line 3-4) builds a query reducing the list of keywords on a query object (AND condition only). The query built is for keys only as we are never interested in anything else from DocumentKeywords. Second step (line 6) uses the query to retrieve keys only (keys of DocumentKeywords that is). And lastly (lines 8-9) we map retrieved keys to parent Document keys and retrieve desired documents with batch get.

I find this Python solution more elegant and concise than Java/Objectify code. And it is clear that Java code is saved by Objectify especially the query part where all datastore API features are clearly defined and easily available.

Some questions arise when discussing Relation Index Entity pattern (RIE) that are about its feasibility in real applications. Performance issues associated with creating and maintaining such index are common ones as well as its applicability in modeling data (for example see this post by Jeff Schnitzer of Objectify fame).

In my view the effectiveness of RIE depends on how this index is used (just like any index in general). If its usage is in order of 100s or more reads than updates then it could be feasible. The higher the ratio the better. Thus for mostly static data such as document library or other references near free text search implemented with RIE is definitely option to consider. Of course the ultimate test should be a GAE bill that reflects resources spent on both queries and updates and where alternatives stand against this bill.

The other point is that RIE should not play any part in the data model. Effectively it is an extension of indexes generated by datastore and should be treated as such.

And lastly there is clearly high initial cost of creating RIE index for datastore artifacts (like documents). I can recommend using task queues, mapreduce and/or blobstore and do as much pre-processing locally as possible before loading data into GAE. For example, document keywords can be extracted and processed on your own server before uploading them to datastore. That way you can end up with simple CSV file to process on GAE.

5 comments:

sys.out said...

Nice post! I've submitted it on GAE Cupboard.

jelly said...

what is the second parameter of fetch(100, 0)?

Gregory Kanevsky said...

jelly, fetch(100, 0) retrieves 100 entities skipping 0 first results.

powerpopp said...

What is the type of 'keywords' sent into the find routine? I tried sending in a python list and received an error saying the second arg of reduce must be iterable.

powerpopp said...

I couldnt get the reduce to work so I unfolded the query build up like so:

query = db.Query(BrighterOfferKeywords, keys_only=True)

for tag in keywords:
query = query.filter('keywords =', tag)