Sunday, June 19, 2011

Enhancing JUnit Suites with Categories to eliminate test-suite dependency

With JUnit 3 the extra step of creating and maintaining test suites on top of unit tests never felt right. With JUnit 4 test suites became just simple annotation boilerplate:
@RunWith(Suite.class)
@Suite.SuiteClasses({
  SomeTests.class,
  SomeOtherTests.class,
  SomethingElseTests.class
})
public class SomeTestSuite {
  // this class is just a place holder for test suite annotations above
}

Next logical step would be aggregation of tests based on the information contained in tests themselves. Instead of specifying concrete tests, test suites would contain qualifiers (using annotations) to match. Those tests having matching qualifiers are included in a suite, those without are not. For example I have a qualifier JMSTest that is obviously assigned to tests that use a JMS provider.

Thus, my test suite classes will become completely decoupled from the tests and vice verse. This is actually even better than it sounds: even though there is no language dependency from tests to test suites in JUnit, functional dependency does exist: tests will not run if they are not bound to one or more test suites.

With introduction of JUnit Categories we received a qualifier support that is almost what we need:

public interface JMSTest {}


@RunWith(Categories.class)
@Categories.IncludeCategory(JMSTest.class)
@Suite.SuiteClasses({
  SomeTestSuite.class,
})
public class JMSTestSuite {}


@Category(IntegrationTest.class)
public class SomeTests {
   ....
}


@Category(JMSTest.class) 
Public class SomeOtherTests{
   ....
}


@Category(DatabaseTest.class) 
Public class SomethingElseTests{
   ....
}

This setup will run SomeOtherTests marked with JMSTest category when running JMSTestSuite. But categories didn’t eliminate the dependency from suites to tests – we still depend on test suite SomeTestSuite that explicitly references our tests.

Now, imagine you can define suite AllProjectTests that always contains all tests from the project. Then you can define category-based test suites like JMSTestSuite above and never care about maintaining your test suites again. Fortunately, we can use open source project to do just that - ClasspathSuite:

import org.junit.extensions.cpsuite.ClasspathSuite;
import org.junit.runner.RunWith;
@RunWith(ClasspathSuite.class)
public class AllProjectTests {}


@RunWith(Categories.class)
@Categories.IncludeCategory(JMSTest.class)
@Suite.SuiteClasses({
  AllProjectTests.class,
})
public class JMSTestSuite {}

To summarize, if you want to define a test suite that runs all database-based tests:
1. Define AllProjectTests suite using ClasspathSuite.
2. Define JUnit category DatabaseTest.
3. Define corresponding suite DatabaseTestSuite to run all tests marked with category DatabaseTest.
4. More complex category-based suites are easy to construct with JUnit category support.

Wednesday, June 8, 2011

Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood)

Extracting and processing text from multiple sources (file formats) is the job Apache Tika does quite well. It abstracts you away from format internals and Tika's coverage (pdf, MS Office, graphics, audio, video, etc.) is superb. Tika doesn't implement actual parsers - instead it offers uniform API to access other parsers for supported document types (all you need is to implement SAX parser ContentHandler - see here). Indeed Tika utilizes PDFBox internally for pdf files. Nothing prevents you from concentrating on parsing extracted text and converting it to information. Except when little details begin to matter.

Sometimes extracting and processing text depends on order of lines in the document: for example headers carry over additional information for line items below them. By default, Tika parser (actually PDFBox parser that does it in case of pdf) will not keep the order when stripping the text out (see org.apache.pdfbox.util.PDFTextStripper and its property sortByPosition). Thus, some headers may occur after its line items are fed to a handler. Apache PDFBox explains that performance is better when order is not preserved. Tika as of version 0.9 doesn't let you control this behavior in PDFBox (they plan addressing it in 1.0 release - see TIKA-100 and TIKA-612).

Simple patch allows us to take advantage of sortByPosition property in PDFBox PDFTextStripper when using it with Tika: lines 6-10 below replaced setSortByPosition(true or false); and remember that PDF2XHTML (from Tika) extends PDFTextStripper (from PDFBox):
private PDF2XHTML(ContentHandler handler, Metadata metadata)
            throws IOException {
        this.handler = new XHTMLContentHandler(handler, metadata);
        setForceParsing(true);
        
        // CUSTOM CODE:
        String sortEnabled = metadata.get("org.apache.tika.parser.pdf.sortbyposition");
        if (sortEnabled != null) {
            setSortByPosition(sortEnabled.equalsIgnoreCase("true"));
        }
}
Now, you can control order in pdf parser like follows (there are no PDFBox classes in this code):
        
        Parser parser = new PDFParser();
        ContentHandler handler = new BodyContentHandler(myHandler);
        Metadata metadata = new Metadata();
        metadata.set("org.apache.tika.parser.pdf.sortbyposition", "false");
        ParseContext context = new ParseContext();

        try {
            parser.parse(instream, handler, metadata, context);
        } finally {
            instream.close();
        }

Of course, if you deal only with single file type (e.g. pdf) then it's easier to use dedicated library such as Apache PDFBox. Then my recommendation would be downloading Tika source code for real examples of PDFBox in action.

UPDATE
The issue is still present in Tika 1.0 with hard-coded setSortByPosition(false); in org.apache.tika.parser.pdf.PDF2XHTML.

The upcoming 1.1 release will add PDFParser.setSortByPosition method, so patch would be replaced with this:
      
        Parser parser = new PDFParser();
        parser.setSortByPosition(true); // or false
        ContentHandler handler = new BodyContentHandler(myHandler);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();

        try {
            parser.parse(instream, handler, metadata, context);
        } finally {
            instream.close();
        }