Wednesday, June 8, 2011

Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood)

Extracting and processing text from multiple sources (file formats) is the job Apache Tika does quite well. It abstracts you away from format internals and Tika's coverage (pdf, MS Office, graphics, audio, video, etc.) is superb. Tika doesn't implement actual parsers - instead it offers uniform API to access other parsers for supported document types (all you need is to implement SAX parser ContentHandler - see here). Indeed Tika utilizes PDFBox internally for pdf files. Nothing prevents you from concentrating on parsing extracted text and converting it to information. Except when little details begin to matter.

Sometimes extracting and processing text depends on order of lines in the document: for example headers carry over additional information for line items below them. By default, Tika parser (actually PDFBox parser that does it in case of pdf) will not keep the order when stripping the text out (see org.apache.pdfbox.util.PDFTextStripper and its property sortByPosition). Thus, some headers may occur after its line items are fed to a handler. Apache PDFBox explains that performance is better when order is not preserved. Tika as of version 0.9 doesn't let you control this behavior in PDFBox (they plan addressing it in 1.0 release - see TIKA-100 and TIKA-612).

Simple patch allows us to take advantage of sortByPosition property in PDFBox PDFTextStripper when using it with Tika: lines 6-10 below replaced setSortByPosition(true or false); and remember that PDF2XHTML (from Tika) extends PDFTextStripper (from PDFBox):
private PDF2XHTML(ContentHandler handler, Metadata metadata)
            throws IOException {
        this.handler = new XHTMLContentHandler(handler, metadata);
        setForceParsing(true);
        
        // CUSTOM CODE:
        String sortEnabled = metadata.get("org.apache.tika.parser.pdf.sortbyposition");
        if (sortEnabled != null) {
            setSortByPosition(sortEnabled.equalsIgnoreCase("true"));
        }
}
Now, you can control order in pdf parser like follows (there are no PDFBox classes in this code):
        
        Parser parser = new PDFParser();
        ContentHandler handler = new BodyContentHandler(myHandler);
        Metadata metadata = new Metadata();
        metadata.set("org.apache.tika.parser.pdf.sortbyposition", "false");
        ParseContext context = new ParseContext();

        try {
            parser.parse(instream, handler, metadata, context);
        } finally {
            instream.close();
        }

Of course, if you deal only with single file type (e.g. pdf) then it's easier to use dedicated library such as Apache PDFBox. Then my recommendation would be downloading Tika source code for real examples of PDFBox in action.

UPDATE
The issue is still present in Tika 1.0 with hard-coded setSortByPosition(false); in org.apache.tika.parser.pdf.PDF2XHTML.

The upcoming 1.1 release will add PDFParser.setSortByPosition method, so patch would be replaced with this:
      
        Parser parser = new PDFParser();
        parser.setSortByPosition(true); // or false
        ContentHandler handler = new BodyContentHandler(myHandler);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();

        try {
            parser.parse(instream, handler, metadata, context);
        } finally {
            instream.close();
        }

4 comments:

Ben said...

Thank you.

Anonymous said...

Good post!!
Would be grateful though if you could explain some of the variables used like myHandler in the second script. Is it the same handle in the first one? Also I dont understand how the classes are related.
Thanks

Gregory Kanevsky said...

BodyContentHandler is just a decorator class in Tika - it passes everything inside XHTML body tag to the underlying handler - myHandler in this case. Thus, all text PDFParser extracts from your pdf document gets to myHandler via its callbacks (defined in org.xml.sax.ContentHandler which myHandler must implement).

Yes, handlers are the same. Second piece of code is an example of using Tika to parse pdf document. After parser's (in our case it's Tika PDFParser) method parse completes myHandler contains all relevant text from pdf (it's your responsibility to recognize and accumulate it by overriding callback methods from ContentHandler I mentioned before).

abc said...

Is you know how to convert Docx to HTml using Tika?