Strategic information that is relevant for data-based decision-making in the areas of law and intellectual property is most often stored in PDF documents. Information such as who was the judge who decided a lawsuit, what was the reason for rejection, in the case of patents, who were the examiners who signed a technical examination report or decision, what was the reason, and what articles were used as a basis for the rejection of a patent are just a few examples.
Information is usually stored in an unstructured way, and a simple OCR procedure is often not enough. Nowadays we have a lot of APIs that use artificial intelligence that we can use to extract information in a structured way. Here’s an example of form-aware APIs. These tools can extract, for example, a table in table form from a PDF document. There are several solutions on the market. The solutions I’ve had the opportunity to test are Google Document AI and Azure Form Recognizer.
Let’s take a look at the pros and cons of each option to help you decide.
Google Document AI Pros:
- integrates with Google Drive, making it easy to use for businesses that already use Google products
- offers a free tier with limited features for businesses on a budget
- an easy-to-use interface makes it quick to get started with little training required
Google Document AI Cons:
- lacks some of the more advanced features offered by competitors, making it less suitable for businesses with complex needs
- not as widely used as some competitors, making it harder to find support and resources if you encounter problems
- pricing can be expensive for businesses that need more than the free tier offers
Azure Form Recognizer Pros:
- offers more advanced features than Google Document AI, making it better suited for businesses with complex needs
- widely used, meaning there’s plenty of support and resources available if you encounter problems
- pricing is based on usage, so you only pay for what you need
Azure Form Recognizer Cons:
- not as easy to use as Google Document AI so it may require more training for employees
- doesn’t integrate with other Microsoft products as seamlessly as Google Document AI integrates with Google products
I tested using the Azure Form Recognizer API on a patent technical examination report downloaded from Brazilian Patent and Trademark Office (BRPTO). Documents are normally in the format below. If you want to see the file in full click here.
If we simply perform an OCR on these tables, the data looks like this:
“Quadro 2 – Considerações referentes aos Artigos 10, 18, 22 e 32 da Lei n.o 9.279 de 14 demaio de 1996 – LPI Artigos da LPISim NãoA matéria enquadra-se no art. 10 da LPI (não se considera invenção)XA matéria enquadra-se no art. 18 da LPI (não é patenteável)XO pedido apresenta Unidade de Invenção (art. 22 da LPI)XO pedido está de acordo com disposto no art. 32 da LPIXComentários/Justificativas”
“Quadro 3 – Considerações referentes aos Artigos 24 e 25 da LPIArtigos da LPISim NãoO relatório descritivo está de acordo com disposto no art. 24 da LPIXO quadro reivindicatório está de acordo com disposto no art. 25 da LPIX”
We could not efficiently and accurately identify the options indicated in the tables. So the best solution is to use an API that recognizes tables as shown below:
You can see that the columns in the tables are recognized perfectly, and we extracted the data exactly as it is in the table converted to JSON format.
If you want, you can download the JSON file here.
From these form recognition APIs, we can create an algorithm to perform a mass reading and save the structured information in a Data Lake, Database, or whatever format you need to use in your data analysis.
If you liked the post and want me to make an example of the algorithm in Python, write below in the comments that I will be happy to share it with you.
That’s it for today!