Friday, September 23, 2011

One more time: Relation Index Entities with Python for Google Datastore

Hey, why not? Relation Index Entities implemented in Python should nicely complement my RIE with Java/Objectify post. Because I follow the same example and concepts I will not go over them twice but rather concentrate on specifics of Python example (again, please see RIE with Java/Objectify post for general concepts).

I start with defining datastore entities in Python:
class Document(db.Model):
    title = db.StringProperty()
    abstract = db.StringProperty()
    authors = db.StringListProperty()
    publisher = db.StringProperty()
    tags = db.StringListProperty()
    

class DocumentKeywords(db.Model):
    keywords = db.StringListProperty()

As before list property keywords of DocumentKeywords is a critical element of RIE design. The brevity of Python implementation is striking when comparing entity definitions with Java/Objectify version. For example, there is nothing that indicates these 2 entities comprise an entity group. The entity classes as defined may or may not belong to the same entity group: during construction of concrete instances we can establish optional parent-child relationship.

Let’s see how we could add new Document with Python:

def add_document(title, authors, publisher, tags):
    
    doc = Document()
    doc.title = title
    doc.authors = authors
    doc.publisher = publisher
    doc.tags = tags
    
    doc.put()
    
    keywords = []
    keywords.append(doc.title)
    keywords.extend(doc.authors)
    keywords.append(doc.publisher)
    keywords.extend(doc.tags)
    
    doc_keywords = DocumentKeywords(parent=doc, keywords=keywords)
    doc_keywords.put() 
    
    return doc


By constructing DocumentKeyword with parent argument it becomes part of the document entity group so we can call add_document within a transaction:

doc = db.run_in_transaction(add_document, title, authors, publisher, tags)
  

Of course I oversimplified this example as real keywords would have been treated more carefully with consistent case, stop word filtering, robust parsing, etc. But this is not the goal of this exercise so I am leaving it all out.

Finally, the keyword search method:

def find_by_keywords(keywords):
        
        query = reduce(lambda x, y : x.filter('keywords =', y), 
                       keywords.insert(0, db.Query(DocumentKeywords, keys_only=True)))
        
        keywords_keys = query.fetch(100, 0)
        
        doc_keys = map(lambda x:x.parent(), keywords_keys)
        docs = db.get(doc_keys)
        
        return docs

Again as in Java/Objectify version it is a 3-step process. First step (line 3-4) builds a query reducing the list of keywords on a query object (AND condition only). The query built is for keys only as we are never interested in anything else from DocumentKeywords. Second step (line 6) uses the query to retrieve keys only (keys of DocumentKeywords that is). And lastly (lines 8-9) we map retrieved keys to parent Document keys and retrieve desired documents with batch get.

I find this Python solution more elegant and concise than Java/Objectify code. And it is clear that Java code is saved by Objectify especially the query part where all datastore API features are clearly defined and easily available.

Some questions arise when discussing Relation Index Entity pattern (RIE) that are about its feasibility in real applications. Performance issues associated with creating and maintaining such index are common ones as well as its applicability in modeling data (for example see this post by Jeff Schnitzer of Objectify fame).

In my view the effectiveness of RIE depends on how this index is used (just like any index in general). If its usage is in order of 100s or more reads than updates then it could be feasible. The higher the ratio the better. Thus for mostly static data such as document library or other references near free text search implemented with RIE is definitely option to consider. Of course the ultimate test should be a GAE bill that reflects resources spent on both queries and updates and where alternatives stand against this bill.

The other point is that RIE should not play any part in the data model. Effectively it is an extension of indexes generated by datastore and should be treated as such.

And lastly there is clearly high initial cost of creating RIE index for datastore artifacts (like documents). I can recommend using task queues, mapreduce and/or blobstore and do as much pre-processing locally as possible before loading data into GAE. For example, document keywords can be extracted and processed on your own server before uploading them to datastore. That way you can end up with simple CSV file to process on GAE.

Thursday, September 1, 2011

Groovy. Batch. Prepared statement. Nice!

Scripting with Groovy is exciting thing. You get a feeling that the language was inspired by an Oracle who read your mind and then made it 10 times better. So imagine how I felt after finding out that Groovy can NOT batch prepared statements.

Batching sql updates, inserts or deletes is one of the top features that database scripts would need. Without prepared statements I had to resort to generating SQL with GString:
sql.withBatch { stmt ->
          mymap.each { k,v ->
              stmt.addBatch("""UPDATE some_table 
                                  SET some_column = '${v}' 
                                WHERE id = ${k} """)
          }
}
Besides SQL injection this presents the problem of escaping strings in SQL: big pain in some cases. By arguing that injection is not an issue for internal script (it's not in a wild on the web after all) you would leave yourself with the loop hole anyway. Don't forget about performance. The bottom line: I need support for prepared statements!

I would have to stop here joining ranks of complaints like this if not for Groovy 1.8.1. This latest stable version (as of today) addresses bunch of bugs and just couple of features. And one of two is batch support for prepared statements. Below is secure and reliable (as well as more readable) version with batch support for prepared statement in 1.8.1:
sql.withBatch(20, """UPDATE some_table 
                        SET some_column = ? 
                      WHERE id = ? """) { ps ->   
              
          mymap.each { k,v ->
              ps.addBatch(v, k)
          }
}

You can find more options on how use batching with prepared statements in Groovy 1.8.1 docs.