Because best way to understand is explaining.

Tuesday, February 22, 2011

Efficient Keyword Search with Relation Index Entities and Objectify for Google Datastore

Free text search with keywords on Google App Engine datastore made simple - in fact simple enough to fit into single blog entry.

I will use GAE/Java with Objectify for datastore API (also see my newer post with Python implementation). Assume we maintain a document library where each document has several textual attributes: name, title, subtitle, authors, publisher, reference number (similar to ISBN), tags, abstract, etc. While each attribute is semantically different, for a searcher they all present some value (or relevance). Thus, user may search for any of them with one or more keywords. For simplification, I consider only AND searches.

First, let’s model our entities (remember, we use Objectify that in turn uses standard JPA annotations wherever possible):

@Entity(name = "Document")
public class Document {

  @Id
  private Long id;

  private String title;
  private List<String> authors = = new ArrayList<String>();
  private String publisher;
  private List<String> tags = new ArrayList<String>();
  // more attributes as necessary...

  public Document() {
    super();
  }

  // standard getters/setters follow...
}


One thing to emphasize is the use of list properties such as authors and tags. Datastore treats them as multi-valued attributes so that condition like authors == ‘John Doe’ would return all documents that have John doe as one of authors. This list property feature is critical in the next (and last) entity we define:

@Entity(name = "DocumentKeywords")
public class DocumentKeywords {

  @Id Long id;
  @Parent Key<Document> document;
  List<String> keywords = new ArrayList<String>();

  private DocumentKeywords() {
    super();
  }

  public DocumentKeywords (Key<Document> parent) {
    this(parent, Collections.<string>emptyList());
  }

  public DocumentKeywords (Key<Document> parent, Collection<String> keywords) {
    super();

    this. document = parent;
    this.keywords.addAll(keywords);
  }

  // add single keyword
  public boolean add(String keyword) {
    return keywords.add(keyword);
  }

  // add collection of keywords
  public boolean add(Collection<String> keywords) {
    return this.keywords.addAll(keywords);
  }
}


There are several things worth noting about DocumentKeywords.

First, it’s a child entity to Document (see @Parent annotation in Objectify). Parent Document and child DocumentKeywords make an entity group in datastore. This is important for data integrity – entity group rows can participate in transactions in datastore. Data integrity is critical in this case (you'll see shortly). Indeed, we'll duplicate attribute values between Document and DocumentKeywords. For each Document entity we create corresponding child DocumentKeywords to consolidate all document attributes into property keywords.

Secondly, keywords is a list property. List property is limited to 5000 entries which is often sufficient. And if it’s not we could add more DocumentKeywords child rows for the same Document parent (not implemented here).

Finally, what is DocumentKeywords entity defined for? Why is its keywords attribute not part of Document entity? The answer is in this Google IO presentation (Spoiler: Keywords being list property in Document would produce serialization overhead on Document entity (at least doubling it since it's exact copy of the rest of Document attributes). Moving keywords to separate entity is called Relation Index Entity and it gives us best of both worlds: fully indexed attributes (via list property) and no serialization overhead for documents.)

We add new Document and index document attributes in child DocumentKeywords in one transaction:

// we use DI to initialize factory (application scoped)
private final ObjectifyFactory factory;

private Objectify tran = null;

public Document addDocument(Document document) throws {

  try {
    Objectify ofy = beginTransaction();

    Key<Document> key = ofy.put(document);

    // crate Relation Index Entity
    DocumentKeywords rie = new DocumentKeywords(key);
    rie.add(document.getTitle());
    rie.add(document.getAuthors());
    rie.add(document.getPublisher());
    rie.add(document.getTags());
    ofy.put(rie);

    commit();

    Document savedDocument = beginQuery().find(key);
    return savedDocument;

  }finally {
    rollbackIfActive();
  }
}

I left out as an exercise transactional methods above. Just note important datastore gotcha: rows added inside transaction are not available within this transaction if it’s still active. If you add row and then try to query it before committing then you won’t find it. Commit transaction first and then read its data.

Now we are ready to make actual keyword searches:

public Collection<document> findByKeywords(Collection<String> keywords) {

  Objectify ofy = beginQuery();

  Query<DocumentKeywords> query = ofy.query(DocumentKeywords.class);
  for (String keyword : keywords) {
    query = query.filter("keywords", keyword);
  }

  Set<Key<Document>> keys = query.<Document>fetchParentKeys();

  Collection<Document> documents = ofy.get(keys).values();

  return documents;
}


You can see that keyword search is a 3-step process: during first step we iteratively build search condition (AND only), in second step we query DocumentKeywrods toretrieve keys only – no overhead of serialization bulky keywords here. And lastly we convert retrieved DocumentKeywords keys into parent keys (documents) and use datastore batch get to return them. Objectify made all steps quite transparent and efficient.

This is all to it. Let me make few comments about this example. It is purposely contrived but it should map to real cases with no principal changes. Documents could be friends in a social network, products from online retail catalog, or blog entries in blogging web site. I intentionally left document content out of the list of attributes. Current limit of datastore doesn’t allow me to build elegant and concise solution beyond 5000 thousand keywords per document so it makes inclusion of document content risky. Even though simple enhancement trounces this limitation I didn’t want to overload code above.

Extending to free text search would mean support for such features as word normalization and stemming, case-sensitivity, logical operations, keyword proximity (e.g. same attribute or related attributes), extending beyond datastore 5000 list property limit.

References:
1. Building Scalable, Complex Apps on App Engine
2. Datastore List Property
3. Stemming
4. Objectify
5. RIE with Python

7 comments:

turbomanage said...

Great post, Gregory. I think the concept is exactly right, and the sample code clear and concise. Thanks for sharing.

Thomvis said...

Nice article! I implemented something very similar and was planning on blogging about it. I've a question: does your solution allow prefix search? In other words: does query.filter("keywords >", "Objec").filter("keywords <", "Objec" + '\uFFFD') work?

Gregory Kanevsky said...

Thomvis, list property with multiple inequalities would work for prefix value. Yes, I just tested it and it worked for single prefix. Which means to implement keyword prefix search there will be N queries for N keywords with subsequent merge in memory. I wonder if Objectify has something similar to Twig-Persist parallel async commands...

Gregory Kanevsky said...

Also, to match prefix with empty suffix condition becomes
query.filter("keywords >=", "Objec").filter("keywords <", "Objec" + '\uFFFD')

erolagnab said...

thanks for the clear explanation.

I'm wondering what would happen if we save a modified Document? Would a new DocumentKeywords will be inserted? How could we delete the old DocumentKeywords?

Gregory Kanevsky said...

erolagnab, thanks and good question. I'd treat update to Document as delete and insert for its keywords. There is clearly some room for optimization but for most cases this approach should be ok.

Anonymous said...

I like the way search functionality has been implemented. Perhaps to add a tip - if you would like to use the same documentkeywords class (I would call it SearchKeywords to be generic) for various other classes - books, cars, etc. perhaps we can add Class.name as one of the keywords to it. This way if you want to search in SearchKeywords.

Only if we had IN