Python Automation

Creating templates for reading and processing text PDFs with Python RPA – Meet BotCity Documents – Part 1

We have to deal with different types of PDF documents in many areas, be they text or images. And there is a much easier way to solve the processing of these documents using Python RPA instead of regex, XML, or other more complex procedures. And all this with support from other OCR tools or frameworks as well.

In this article, we will show you one of our frameworks: BotCity Documents, which, in addition to being integrated with BotCity Studio, facilitates the PDF documents’ reading, individually or in batches, using computer vision to convert the reading and analysis relationship between fields and values in the documents for Python code that you can add to your RPA automation project.

In this part 1 of the BotCity Documents series, you will see how to process text PDF documents. In part 2 we will demonstrate how to process PDF documents with images.

💡 Important: BotCity Documents is currently not available in the community license. But expect news soon. 👀

How to use BotCity Documents

Let’s take a step-by-step to understand how to use the framework. For this model we will use a desktop type bot, but you can also use web type bots. To create a similar template, you can follow the instructions in this link from our documentation.

However, we emphasize that our BotCity Documents framework is not for exclusive use in automation projects built with our Desktop or Web framework. It is a standalone package, and although you can use it with other tools, it is possible to use it in other projects without any dependencies.

Prerequisites

  1. You must have a license that is not Community;
  2. Install BotCity Studio SDK. You can check how to do this by clicking on this link in our documentation;
  3. Install the BotCity Documents package. To do this, you must contact us for guidance on how to install the package according to your license (which, for now, cannot be the community version).
  4. Have an IDE installed. For example: Visual Studio Code, PyCharm, and others.

Example: reading PDF text files

In your IDE:

Open your favorite IDE that you already use in your daily life. In your project, identify the .py file in which you will develop or create a new one according to the organization and your project patterns. In the example, we created the file “bills.py“.

Import the installed package as indicated in the third item of the prerequisites.

Then instantiate the PDF reader and read your file, as per the example below in Python code.

# Creating a reader
reader = PDFReader()

# Reading the pdf file
parser = reader.read_file("document.pdf")

You can also follow the documentation and see how to do the same in Java, if you are using that language. Check this link.

In BotCity Studio:

Open BotCity Studio and create or upload your project. Click on the “File” option and then “New Project” for a new project. Or “Load Project” to load an existing project.

Identify your project folder and, within it, find the file with the extension “.botproj”. By doing this we will have the same code in both tools: IDE (which in the example we used Visual Studio Code, but it can be other IDEs) and BotCity Studio.

From now on, it is important to leave the cursor positioned in BotCity Studio after the last line of code. Because the codes that are automatically generated by the tool, will be included from it.

At this point we need to load the PDF that we will use to create the template. Click on “Documents” in the top menu. After that, click “Load new document” and identify the PDF file to be read.

After loading the PDF, you will have the “Code” tab, where the code will be; the “UI” tab, which we are not using in this project; and the “PDF” tab, where the file you have loaded will be.

Click on the “PDF” tab. You must identify the fields and respective values the files will have as default to select them. The selections will take place with two pieces of information at a time. The first refers to the field, which will be a red highlight on the BotCity Studio interface, and the second relates to the value of this field, which will be a blue highlight on the BotCity Studio interface.

Analyzing the generated code for its use:

_account_no = parser.get_first_entry("Account No:")
value = parser.read(_account_no, 1.078947, -2.25, 1.513158, 3.5)
_statement_date = parser.get_first_entry("Statement Date:")
value = parser.read(_statement_date, 1.06, -2, 1.18, 3.5)
_due_date = parser.get_first_entry("Due Date:")
value = parser.read(_due_date, 1.079365, -1.75, 1.920635, 3.5)

BotCity Studio starts taking important readings to generate the code. For example, the field in the PDF called “Account no”, it already understands that it can be the variable name and therefore names it “_account_no”. Just as happened with the “Statement Date” and “Due Date” fields.

Let’s validate that the reading is being taken with the right value? Let’s add the print command for each field and compare the result to make sure it is reading the data correctly.

The code could look like this:

_account_no = parser.get_first_entry("Account No:")
value = parser.read(_account_no, 1.078947, -2.25, 1.513158, 3.5)
print(f"Account no: {value}")

_statement_date = parser.get_first_entry("Statement Date:")
value = parser.read(_statement_date, 1.06, -2, 1.18, 3.5)
print(f"Statement Date: {value}")

_due_date = parser.get_first_entry("Due Date:")
value = parser.read(_due_date, 1.079365, -1.75, 1.920635, 3.5)
print(f"Due Date: {value}")

And by running the code from our “bills.py” file, the result can be compared with the PDF:

💡 Important tips:

  • After making the selections you need in the PDF, go to the “Code” tab and press “Ctrl + S” on your keyboard to save the newly generated code. This will make the code appear automatically in your IDE.
  • If you change something in the code from your IDE, when you save the change and click again in BotCity Studio, it will receive the change you made.
  • Even if the mapped field and its value change location within the document, the tool can still read it, as it is mapped with computer vision, as long as the relationship between the anchor and the reading area is maintained.
  • BotCity Studio automatically generates the code considering the variable name created for reading the PDF as parser. If you used another name, simply refactor your code.
  • If you do not want to see the outlines of what you are selecting, you can click the checkbox in the “Show Selections” field. If you have ✅, the selections will be shown. If ✅ is missing, the selections will not be shown.
  • If you make a mistake in the first selection (from the field highlighted in red), press the ESC key on your keyboard, that single selection will be erased, and you will be able to do it again.
  • If you make many mistakes, you can delete the code snippet generated in the “Code” tab and reselect what you need.
  • Note that in the demo GIF, we show that the field selection is snapped to the word, while the value selection is more extensive in the PDF. This is because the field is fixed, but the value can be larger than the model we checked. With this wider selection, we can guarantee that we will get the full result.
  • You can also follow this quick video on our YouTube channel demonstrating the use. Subscribe now and receive news there.

And for batch execution?

There are some libraries in Python itself that can be a support for PDF file discovery. For example: pathlib or glob. In addition to creating structures, using features of the language itself, for execution in a loop, of several files, according to the needs of your project.

What do you think?

Did you like this feature? If you have any questions, don’t hesitate to contact us, and don’t forget to join our community, to enjoy the exchange of knowledge and experiences.

She/her. I am a Tech Writer and Developer Relations at BotCity. I am also a tech content creator who loves tech communities and people.

Leave a Reply

%d bloggers like this: