Document Processing

No more Regex or XML conversion to parse PDF Documents

Discover now a new approach that mimics the way human vision addresses document reading

When creating parsers to read PDF documents automatically, the use of regex rules or converting the document to a structured format like XML to parse it are very common approaches. In both cases, you need to figure out specific rules (regex or XML parsing) for each field in the document. 

Let’s see an example of parsing some document fields using regex:

Now, parsing an XML document:

As can be seen, the parser development can be very laborious depending on the number of fields in the document.

Moreover, both approaches are very sensitive to changes in the document, like omitting a field or changing its position. Even if this change seems minimal when visualizing the document, it might break the parser since it’s not based on document visual structure.

Now, let’s take a look at this problem from a different perspective. Why can humans still read a document even if the position or fields are changed? The answer is pretty simple: humans don’t read documents taking into account the position of the fields in the document. For us, usually, we look for a relationship between label and value:

In red we have the labels that are basically the definition of the field in question and in blue we have the value. Usually, fields (labels and values) are grouped by some context to make the reading process easier, but if we change the position of the fields in the document, humans can still understand the document without any trouble.

What if it was possible to use the same concept when creating parsers to read documents automatically? What if there was a tool that let you generate the parser code automatically as you click in the field documents and values?

Let’s talk about BotCity Documents

BotCity Documents is a framework which allows you to easily create parsers and read documents using Python or Java programming language, in the same way as you naturally would read a document, by establishing a relation between labels and fields.

Using BotCity Studio intuitive interface and automatic code generation alongside the BotCity Documents framework for document parsing, code to parse a given field in the document is generated pretty simply:

Step 1 – Select the field in the document

Step 3 – Code is generated automatically

This process is repeated for each field in the document you need to read and your custom parser is built in minutes.

By leveraging the BotCity plugins to seamlessly integrate with your favorite OCR provider, such as Google Cloud Vision, Azure Cognitive Services or even the open-source project Tesseract, BotCity Documents can be extended to transparently deal with not only text-based PDFs but also scanned PDFs and image files using the same codebase.

All this means less headache creating multiple readers, parsers and integration with third-party services.

Take a look into BotCity Documents in action and see how you can boost your team’s productivity by constructing parsers not only faster but in a maintainable and more reliable way.

Head of Developer Experience @ BotCity.

Leave a Reply

%d bloggers like this: