Where is the most relevant information for data analysis in Law and Intellectual Property?

Strategic information that is relevant for data-based decision-making in the areas of law and intellectual property is most often stored in PDF documents. Information such as who was the judge who decided a lawsuit, what was the reason for rejection, in the case of patents, who were the examiners who signed a technical examination report or decision, what was the reason, and what articles were used as a basis for the rejection of a patent are just a few examples.

Information is usually stored in an unstructured way, and a simple OCR procedure is often not enough. Nowadays we have a lot of APIs that use artificial intelligence that we can use to extract information in a structured way. Here’s an example of form-aware APIs. These tools can extract, for example, a table in table form from a PDF document. There are several solutions on the market. The solutions I’ve had the opportunity to test are Google Document AI and Azure Form Recognizer.

Let’s take a look at the pros and cons of each option to help you decide.

Google Document AI Pros:

integrates with Google Drive, making it easy to use for businesses that already use Google products
offers a free tier with limited features for businesses on a budget
an easy-to-use interface makes it quick to get started with little training required

Google Document AI Cons:

lacks some of the more advanced features offered by competitors, making it less suitable for businesses with complex needs
not as widely used as some competitors, making it harder to find support and resources if you encounter problems
pricing can be expensive for businesses that need more than the free tier offers

Azure Form Recognizer Pros:

offers more advanced features than Google Document AI, making it better suited for businesses with complex needs
widely used, meaning there’s plenty of support and resources available if you encounter problems
pricing is based on usage, so you only pay for what you need

Azure Form Recognizer Cons:

not as easy to use as Google Document AI so it may require more training for employees
doesn’t integrate with other Microsoft products as seamlessly as Google Document AI integrates with Google products

I tested using the Azure Form Recognizer API on a patent technical examination report downloaded from Brazilian Patent and Trademark Office (BRPTO). Documents are normally in the format below. If you want to see the file in full click here.

If we simply perform an OCR on these tables, the data looks like this:

“Quadro 2 – Considerações referentes aos Artigos 10, 18, 22 e 32 da Lei n.o 9.279 de 14 demaio de 1996 – LPI Artigos da LPISim NãoA matéria enquadra-se no art. 10 da LPI (não se considera invenção)XA matéria enquadra-se no art. 18 da LPI (não é patenteável)XO pedido apresenta Unidade de Invenção (art. 22 da LPI)XO pedido está de acordo com disposto no art. 32 da LPIXComentários/Justificativas”

“Quadro 3 – Considerações referentes aos Artigos 24 e 25 da LPIArtigos da LPISim NãoO relatório descritivo está de acordo com disposto no art. 24 da LPIXO quadro reivindicatório está de acordo com disposto no art. 25 da LPIX”

We could not efficiently and accurately identify the options indicated in the tables. So the best solution is to use an API that recognizes tables as shown below:

You can see that the columns in the tables are recognized perfectly, and we extracted the data exactly as it is in the table converted to JSON format.

If you want, you can download the JSON file here.

From these form recognition APIs, we can create an algorithm to perform a mass reading and save the structured information in a Data Lake, Database, or whatever format you need to use in your data analysis.

If you liked the post and want me to make an example of the algorithm in Python, write below in the comments that I will be happy to share it with you.

That’s it for today!

Author: Lawrence Teixeira

With over 30 years of expertise in the Technology sector and 18 years in leadership roles as a CTO/CIO, he excels at spearheading the development and implementation of strategic technological initiatives, focusing on system projects, advanced data analysis, Business Intelligence (BI), and Artificial Intelligence (AI). Holding an MBA with a specialization in Strategic Management and AI, along with a degree in Information Systems, he demonstrates an exceptional ability to synchronize cutting-edge technologies with efficient business strategies, fostering innovation and enhancing organizational and operational efficiency. His experience in managing and implementing complex projects is vast, utilizing various methodologies and frameworks such as PMBOK, Agile Methodologies, Waterfall, Scrum, Kanban, DevOps, ITIL, CMMI, and ISO/IEC 27001, to lead data and technology projects. His leadership has consistently resulted in tangible improvements in organizational performance. At the core of his professional philosophy is the exploration of the intersection between data, technology, and business, aiming to unleash innovation and create substantial value by merging advanced data analysis, BI, and AI with a strategic business vision, which he believes is crucial for success and efficiency in any organization. View all posts by Lawrence Teixeira