Where is the most relevant information for data analysis in Law and Intellectual Property?

Strategic information that is relevant for data-based decision-making in the areas of law and intellectual property is most often stored in PDF documents. Information such as who was the judge who decided a lawsuit, what was the reason for rejection, in the case of patents, who were the examiners who signed a technical examination report or decision, what was the reason, and what articles were used as a basis for the rejection of a patent are just a few examples.


Information is usually stored in an unstructured way, and a simple OCR procedure is often not enough. Nowadays we have a lot of APIs that use artificial intelligence that we can use to extract information in a structured way. Here’s an example of form-aware APIs. These tools can extract, for example, a table in table form from a PDF document. There are several solutions on the market. The solutions I’ve had the opportunity to test are Google Document AI and Azure Form Recognizer.

Let’s take a look at the pros and cons of each option to help you decide.

Google Document AI Pros:

  • integrates with Google Drive, making it easy to use for businesses that already use Google products
  • offers a free tier with limited features for businesses on a budget
  • an easy-to-use interface makes it quick to get started with little training required

Google Document AI Cons:

  • lacks some of the more advanced features offered by competitors, making it less suitable for businesses with complex needs
  • not as widely used as some competitors, making it harder to find support and resources if you encounter problems
  • pricing can be expensive for businesses that need more than the free tier offers

Azure Form Recognizer Pros:

  • offers more advanced features than Google Document AI, making it better suited for businesses with complex needs
  • widely used, meaning there’s plenty of support and resources available if you encounter problems
  • pricing is based on usage, so you only pay for what you need

Azure Form Recognizer Cons:

  • not as easy to use as Google Document AI so it may require more training for employees
  • doesn’t integrate with other Microsoft products as seamlessly as Google Document AI integrates with Google products

I tested using the Azure Form Recognizer API on a patent technical examination report downloaded from Brazilian Patent and Trademark Office (BRPTO). Documents are normally in the format below. If you want to see the file in full click here.

If we simply perform an OCR on these tables, the data looks like this:

Quadro 2 – Considerações referentes aos Artigos 10, 18, 22 e 32 da Lei n.o 9.279 de 14 demaio de 1996 – LPI Artigos da LPISim NãoA matéria enquadra-se no art. 10 da LPI (não se considera invenção)XA matéria enquadra-se no art. 18 da LPI (não é patenteável)XO pedido apresenta Unidade de Invenção (art. 22 da LPI)XO pedido está de acordo com disposto no art. 32 da LPIXComentários/Justificativas

Quadro 3 – Considerações referentes aos Artigos 24 e 25 da LPIArtigos da LPISim NãoO relatório descritivo está de acordo com disposto no art. 24 da LPIXO quadro reivindicatório está de acordo com disposto no art. 25 da LPIX

We could not efficiently and accurately identify the options indicated in the tables. So the best solution is to use an API that recognizes tables as shown below:

Click on the image for a full screen
Click on the image for a full screen
Click on the image for a full screen
Click on the image for a full screen
Click on the image for a full screen

You can see that the columns in the tables are recognized perfectly, and we extracted the data exactly as it is in the table converted to JSON format.

If you want, you can download the JSON file here.

From these form recognition APIs, we can create an algorithm to perform a mass reading and save the structured information in a Data Lake, Database, or whatever format you need to use in your data analysis.


If you liked the post and want me to make an example of the algorithm in Python, write below in the comments that I will be happy to share it with you.

That’s it for today!

How to use Python in Google Colab integrated directly with Power BI to analyze patent data

This blog post will show you how to load and transform patent data and connect Power BI with Google Colab. Google Colab is a free cloud service that allows you to run Jupyter notebooks in the cloud. Jupyter notebooks are a great way to share your code and data analysis with others. Power BI is a business intelligence tool that allows you to visualize your data and create reports. Connecting Power BI with Google Colab allows you to easily share your data visualizations with others. Let’s get started!

What is a patent?

A patent is an exclusive right granted for an invention, which is a product or a process that provides, in general, a new way of doing something or offers a new technical solution to a problem. To get a patent, technical information about the invention must be disclosed to the public in a patent application.

What is WIPO?

WIPO is the global forum for intellectual property (IP) services, policy, information, and cooperation. WIPO’s activities include hosting forums to discuss and shape international IP rules and policies, providing global services that register and protect IP in different countries, resolving transboundary IP disputes, helping connect IP systems through uniform standards and infrastructure, and serving as a general reference database on all IP matters; this includes providing reports and statistics on the state of IP protection or innovation both globally and in specific countries.[7] WIPO also works with governments, nongovernmental organizations (NGOs), and individuals to utilize IP for socioeconomic development. If you need more information about WIPO, click here.

This video can demonstrate the Power BI functionality we will use today

Now, you understand what a patent is and what WIPO is. Let’s start our experiment!

First, we will load the patent data from WIPO. In this experiment, we will use the authority file from 2022.

Python
from powerbiclient import Report, models
from powerbiclient.authentication import DeviceCodeLoginAuthentication
import pandas as pd
from google.colab import drive
from google.colab import output
from urllib import request
import zipfile
import requests

# mount Google Drive
drive.mount('/content/gdrive')

file_url = "https://patentscope.wipo.int/search/static/authority/2022.zip"
	
r = requests.get(file_url, stream = True)

with open("/content/gdrive/My Drive/2022.zip", "wb") as file:
	for block in r.iter_content(chunk_size = 1024):
		if block:
			file.write(block)
   
compressed_file = zipfile.ZipFile('/content/gdrive/My Drive/2022.zip')

csv_file = compressed_file.open('2022.csv')

data = pd.read_csv(csv_file, delimiter=";", names=["Publication Number","Publication Date","Title","Kind Code","Application No","Classification","Applicant","Url"])

#Show the head data
data.head()

Now, we have the data let’s do some transformation to prepare to load in the Power BI report.

Python
# Transformations of the csv file dowloaded from wipo

#remove the two fisrt lines
data = data.iloc[1:]
data = data.iloc[1:]

#create a new column with the Classification name
data["Classification_Name"] = data["Classification"].str[:1]

#Modify this column with the classification description
data["Classification_Name"] = data["Classification_Name"].replace({
    'A': 'Human Necessities', 
    'B': 'Performing Operations and Transporting', 
    'C': 'Chemistry and Metallurgy', 
    'D': 'Textiles and Paper', 
    'E': 'Fixed Constructions', 
    'F': 'Mechanical Engineering', 
    'G': 'Physics', 
    'H': 'Electricity'
  }
)

#Show again the head data
data.head()

#Save the Excel file in google drive to share with the Power BI report.
data.to_excel("gdrive/MyDrive/datasets/Result_WIPO2022.xlsx")

After that, we will connect to Power BI and show the report inside Google Colab.

Python
# Import the DeviceCodeLoginAuthentication class to authenticate against Power BI and initiate the Micrsofot device authentication
device_auth = DeviceCodeLoginAuthentication()

group_id="YOU HAVE TO PUT HERE YOUR POWER BI GROUP ID OR WORKSPACE ID"
report_id="YOU HAVE TO PUT HERE YOUR POWER BI REPORT ID"

report = Report(group_id=group_id, report_id=report_id, auth=device_auth)
report.set_size(1024, 1600)
output.enable_custom_widget_manager()

# Show the power BI report with the wipo downloaded data.
report

Click here, to see this report in full-screen mode.

Follow here the Google Colab file with the Python code. If you want the Power BI report click here.

Conclusion

In this blog post, we showed you how to load data from external datasets, and transform and load in Power BI reports inside Google Colab. By following these steps, you can start using Google Colab and Power BI to analyze your data with Python and easily share it with others!

That’s it for today!

Twitter Sentiment Analysis using Open AI and Power BI

This article is an experiment that explains how to use an Open AI to predict the sentiment analysis and gender in recent tweets for a specific topic and show the result in a Power BI dashboard.

What is Open AI?

The Open AI model is trained on a dataset of 3.6 billion Tweets. The training process takes about 4 days on 8 GPUs. After training, the model can accurately predict the sentiment of a tweet with 85% accuracy. The model can also be fine-tuned to accurately predict the sentiment of tweets from a specific Twitter user with 90% accuracy.

How does it work?

You input some text as a prompt, and the API will return a text completion that attempts to match whatever instructions or context you gave it.

You can think of this as a very advanced autocomplete — the model processes your text prompt and tries to predict what’s most likely to come next.

This video explains better how Open AI works

In our case, we use the expression, “Decide whether a Tweet’s sentiment is positive, neutral, or negative. Tweet:“, to extract the sentiment, and, “Extract the gender and decide whether a name´s gender is male, female, or unknown. Name:“, to extract the gender from the user name.

How does the experiment work?

The Python script gets the recent tweets about a topic and analyzes the sentiment and the gender of the text of each tweet. After that, the result is saved in an Excel file. I don’t recommend it because it can get slow, but it’s possible to run the Python code directly from Power BI. Follow the instructions here.

Before executing the Python script, you must create an account in Twitter develop and Open AI to obtain the “BEARER_TOKEN” and the “OPEN AI KEY” respectively.

Follow below the Python code:

Python
# Twitter sentiment analysis using Open AI and Power BI
# Author: Lawrence Teixeira
# Date: 2022-10-09

# Requirements
# pip install tweepy==4.0
# pip install openai

# Import the packages
import pandas as pd
import tweepy
import openai

# Connect to Twitter API
MY_BEARER_TOKEN = "YOU HAVE TO INSERT HERE YOUR TWITTER BEARER TOKEN"

# create your client
client = tweepy.Client(bearer_token=MY_BEARER_TOKEN)

# Functions to extract sentiment and gender with Open AI API
# if you want to know more examples about how to use Open AI click [here](https://beta.openai.com/examples/).

openai.api_key = "YOU HAVE TO INSERT HERE YOUR OPEN AI KEY"

def Generate_OpenAI_Sentiment(question_type, openai_response ):
    response = openai.Completion.create(
      engine="text-davinci-002",
      prompt= question_type + ":/n/n" + format(openai_response) +"/n/n Sentiment:",
      temperature=0.7,
      max_tokens=100,
      top_p=1,
      frequency_penalty=0.5,
      presence_penalty=0
    )
    return response['choices'] [0]['text']

def Generate_OpenAI_Gender(question_type, openai_response ):
    response = openai.Completion.create(
      engine="text-davinci-002",
      prompt= question_type + ":/n/n" + format(openai_response),
      temperature=0.7,
      max_tokens=100,
      top_p=1,
      frequency_penalty=0.5,
      presence_penalty=0
    )
    return response['choices'] [0]['text']

# Query search for tweets. Here your can put whatever you want.
# if you want to know more about que Twitter query parameters click [here](https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent/).
query = "#UkraineWarNews lang:en"

# if wnat to your start and end time for fetching tweets
#start_time = "2022-10-07T00:00:00Z"
#end_time   = "2022-10-08T00:00:00Z"

# get tweets from the API
tweets = client.search_recent_tweets(query=query,
                                    #start_time=start_time,
                                    #end_time=end_time,
                                     tweet_fields = ["created_at", "text", "source"],
                                     user_fields = ["name", "username", "location", "verified", "description"],
                                     max_results = 100,
                                     expansions='author_id'
                                     )

## Create a data frame to save the results
tweet_info_ls = []
# iterate over each tweet and corresponding user details
for tweet, user in zip(tweets.data, tweets.includes['users']):
    tweet_info = {
        'created_at': tweet.created_at,
        'text': tweet.text,
        'source': tweet.source,
        'name': user.name,
        'username': user.username,
        'location': user.location,
        'verified': user.verified,
        'description': user.description,
        'Sentiment': Generate_OpenAI_Sentiment("Decide whether a Tweet's sentiment is positive, neutral, or negative. Tweet", tweet.text ),
        'Gender': Generate_OpenAI_Gender("Extract the gender and decide whether a name´s gender is male, female, or unknown. Name", user.name ),
        'Query': query.rsplit(' ', 2)[0]
    }
    tweet_info_ls.append(tweet_info)
# create dataframe from the extracted records
tweets_df = pd.DataFrame(tweet_info_ls)

# remove the timezone format
tweets_df['created_at'] = tweets_df['created_at'].dt.tz_localize(None)

# if your use Google Colab, save the result of a csv file in the Google Drive
#tweets_df.to_excel("drive/MyDrive/datasets/Resulados_twitter.xlsx")

# if your want to insert direct in Power BI
print(tweets_df)

Once you execute this Python code and refresh the Power Bi report, you will see the analysis result. In my case, I chose UkraineWarNews. It’s interesting to see in the Power Bi dashboard, that 78% are negative tweets 16% of positive and 33% are male versus 5% female. You can interact with this report by clicking on the visuals.

Click here, to see this report in full-screen mode.

Important: This experiment gets only the last 100 tweets to analyze, and gender is defined only by the spelling of the name and not by the sexual orientation of each individual.

You can download the Power BI report here, and, the version of the Python code in Google Colab here.

There are a lot of possibilities for using this solution in the real world. The Open AI has a lot of other examples like extracting keywords, text summarization, grammar correction, restaurant review creator, and much more. You can access all the examples here. If you have questions about the solution, feel free to comment in the box below.

That´s it for today.

Power BI Licensing Explained

Power BI is a data visualization and business intelligence tool from Microsoft. It allows users to connect to, visualize, and analyze data with greater speed, efficiency, and understanding. In order to use Power BI, you need to purchase a license. But what kind of license should you get? Read on to find out the different types of Power BI licensing and which one is right for you.

Power BI Desktop

Power BI Desktop is a free application that can be downloaded from the Microsoft website. It can be used by individuals or groups working together who want to create reports and visualizations based on their data. This version of Power BI is best for small businesses or teams who want to get started with data visualization and don’t need advanced features or collaboration tools.

Power BI Pro

Power BI Pro is a paid subscription that gives users access to additional features not available in Power BI Desktop. These features include sharing and collaboration tools, support for larger data sets, and more advanced data manipulation and visualization capabilities. Power BI Pro is best for small to medium businesses that need more than just the basics from their data visualization tool.

Power BI Premium

Power BI Premium is a scalable subscription plan that is designed for enterprise-level businesses. It provides all the features of Power BI Pro, plus the ability to host reports and visualizations on your dedicated server infrastructure. This makes it ideal for large businesses with complex data analysis requirements.

Premium Per User (PPU)

Premium Per User (PPU) is a new way to license premium features on a per-user basis and includes all Power BI Pro license capabilities, along with features like paginated reports, AI, and other capabilities that previously were only available with a Premium capacity. With a PPU license, you do not need a separate Power BI Pro license, as all Pro license capabilities are included in PPU.

Users must have a Premium Per User (PPU) license to access content in a Premium Per User (PPU) workspace or app. This requirement includes scenarios where users access the content through the XMLA endpoint, Analyze in Excel, Composite Models, and so on. You can grant access to users to the workspace who don’t have a PPU license, but they will receive a message stating they cannot access the content. They’ll then be prompted to get a trial license if they are eligible. If they aren’t eligible, they must be assigned a license by their Admin to gain access to the resource.

The following table describes who can see which kinds of content with PPU.

Premium Per User (PPU) works with Power BI embedded similarly to a Power BI Pro license. You can embed the content, and each user will need a PPU license to view it.

This video explains about the types of Power BI licenses

Conclusion:

What type of Power BI license is right for you? If you’re an individual or small team just getting started with data visualization, Power BI Desktop will probably suffice. If you need more advanced features and collaboration tools, then you

Follow the Power BI licensing page: https://powerbi.microsoft.com/en-us/pricing/