I will use GAE/Java with Objectify for datastore API (also see my newer post with Python implementation). Assume we maintain a document library where each document has several textual attributes: name, title, subtitle, authors, publisher, reference number (similar to ISBN), tags, abstract, etc. While each attribute is semantically different, for a searcher they all present some value (or relevance). Thus, user may search for any of them with one or more keywords. For simplification, I consider only AND searches.
First, let’s model our entities (remember, we use Objectify that in turn uses standard JPA annotations wherever possible):
@Entity(name = "Document") public class Document { @Id private Long id; private String title; private List<String> authors = = new ArrayList<String>(); private String publisher; private List<String> tags = new ArrayList<String>(); // more attributes as necessary... public Document() { super(); } // standard getters/setters follow... }
One thing to emphasize is the use of list properties such as authors and tags. Datastore treats them as multi-valued attributes so that condition like authors == ‘John Doe’ would return all documents that have John doe as one of authors. This list property feature is critical in the next (and last) entity we define:
@Entity(name = "DocumentKeywords") public class DocumentKeywords { @Id Long id; @Parent Key<Document> document; List<String> keywords = new ArrayList<String>(); private DocumentKeywords() { super(); } public DocumentKeywords (Key<Document> parent) { this(parent, Collections.<string>emptyList()); } public DocumentKeywords (Key<Document> parent, Collection<String> keywords) { super(); this. document = parent; this.keywords.addAll(keywords); } // add single keyword public boolean add(String keyword) { return keywords.add(keyword); } // add collection of keywords public boolean add(Collection<String> keywords) { return this.keywords.addAll(keywords); } }
There are several things worth noting about DocumentKeywords.
First, it’s a child entity to Document (see @Parent annotation in Objectify). Parent Document and child DocumentKeywords make an entity group in datastore. This is important for data integrity – entity group rows can participate in transactions in datastore. Data integrity is critical in this case (you'll see shortly). Indeed, we'll duplicate attribute values between Document and DocumentKeywords. For each Document entity we create corresponding child DocumentKeywords to consolidate all document attributes into property keywords.
Secondly, keywords is a list property. List property is limited to 5000 entries which is often sufficient. And if it’s not we could add more DocumentKeywords child rows for the same Document parent (not implemented here).
Finally, what is DocumentKeywords entity defined for? Why is its keywords attribute not part of Document entity? The answer is in this Google IO presentation (Spoiler: Keywords being list property in Document would produce serialization overhead on Document entity (at least doubling it since it's exact copy of the rest of Document attributes). Moving keywords to separate entity is called Relation Index Entity and it gives us best of both worlds: fully indexed attributes (via list property) and no serialization overhead for documents.)
We add new Document and index document attributes in child DocumentKeywords in one transaction:
// we use DI to initialize factory (application scoped) private final ObjectifyFactory factory; private Objectify tran = null; public Document addDocument(Document document) throws { try { Objectify ofy = beginTransaction(); Key<Document> key = ofy.put(document); // crate Relation Index Entity DocumentKeywords rie = new DocumentKeywords(key); rie.add(document.getTitle()); rie.add(document.getAuthors()); rie.add(document.getPublisher()); rie.add(document.getTags()); ofy.put(rie); commit(); Document savedDocument = beginQuery().find(key); return savedDocument; }finally { rollbackIfActive(); } }
I left out as an exercise transactional methods above. Just note important datastore gotcha: rows added inside transaction are not available within this transaction if it’s still active. If you add row and then try to query it before committing then you won’t find it. Commit transaction first and then read its data.
Now we are ready to make actual keyword searches:
public Collection<document> findByKeywords(Collection<String> keywords) { Objectify ofy = beginQuery(); Query<DocumentKeywords> query = ofy.query(DocumentKeywords.class); for (String keyword : keywords) { query = query.filter("keywords", keyword); } Set<Key<Document>> keys = query.<Document>fetchParentKeys(); Collection<Document> documents = ofy.get(keys).values(); return documents; }
You can see that keyword search is a 3-step process: during first step we iteratively build search condition (AND only), in second step we query DocumentKeywrods toretrieve keys only – no overhead of serialization bulky keywords here. And lastly we convert retrieved DocumentKeywords keys into parent keys (documents) and use datastore batch get to return them. Objectify made all steps quite transparent and efficient.
This is all to it. Let me make few comments about this example. It is purposely contrived but it should map to real cases with no principal changes. Documents could be friends in a social network, products from online retail catalog, or blog entries in blogging web site. I intentionally left document content out of the list of attributes. Current limit of datastore doesn’t allow me to build elegant and concise solution beyond 5000 thousand keywords per document so it makes inclusion of document content risky. Even though simple enhancement trounces this limitation I didn’t want to overload code above.
Extending to free text search would mean support for such features as word normalization and stemming, case-sensitivity, logical operations, keyword proximity (e.g. same attribute or related attributes), extending beyond datastore 5000 list property limit.
References:
1. Building Scalable, Complex Apps on App Engine
2. Datastore List Property
3. Stemming
4. Objectify
5. RIE with Python