Asking questions via chat to the BRPTO’s Basic Manual for Patent Protection PDF, using LangChain, Pinecone, and Open AI

Have you ever wanted to search through your PDF files and find the most relevant information quickly and easily? If you have a lot of PDF documents, such as books, articles, reports, or manuals, you might find it hard to locate the information you need without opening each file and scanning through the pages. Wouldn’t it be nice if you could type in a query and get the best matches from your PDF collection?

In this blog post, I will show you how to build a simple but powerful PDF search engine using LangChain, Pinecone, and Open AI. By combining these tools, we can create a system that can:

Extract text and metadata from PDF files.
Embed the text into vector representations using a language model.
Index and query the vectors using a vector database.
Generate natural language responses using the “text-embedding-ada-002” model from Open AI.

What is LangChain?

LangChain is a framework for developing applications powered by language models. It provides modular abstractions for the components necessary to work with language models, such as data loaders, prompters, generators, and evaluators. It also has collections of implementations for these components and use-case-specific chains that assemble these components in particular ways to accomplish a specific task.

Prompts: This part allows you to create adaptable instructions using templates. It can adjust to different language learning models based on the size of the conversation window and input factors like conversation history, search results, previous answers, and more.

Models: This part serves as a bridge to connect with most third-party language learning models. It has connections to roughly 40 public language learning models, chat, and text representation models.

Memory: This allows the language learning models to remember the conversation history.

Indexes: Indexes are methods to arrange documents so that language learning models can interact with them effectively. This part includes helpful functions for dealing with documents and connections to different database systems for storing vectors (numeric representations of text).

Agents: Some applications don’t just need a set sequence of calls to language learning models or other tools, but possibly an unpredictable sequence based on the user’s input. In these sequences, there’s an agent that has access to a collection of tools. Depending on the user’s input, the agent can decide which tool – if any – to use.

Chains: Using a language learning model on its own is fine for some simple applications, but more complex ones need to link multiple language learning models, either with each other or with other experts. LangChain offers a standard interface for these chains, as well as some common chain setups for easy use.

With LangChain, you can build applications that can:

Connect a language model to other sources of data, such as documents, databases, or APIs
Allow a language model to interact with its environments, such as chatbots, agents, or generators
Optimize the performance and quality of a language model using feedback and reinforcement learning

Some examples of applications that you can build with LangChain are:

Question answering over specific documents
Chatbots that can access external knowledge or services
Agents that can perform tasks or solve problems using language models
Generators that can create content or code using language models

You can learn more about LangChain from their documentation or their GitHub repository. You can also find tutorials and demos in different languages, such as Chinese, Japanese, or English.

What is Pinecone?

Pinecone is a vector database for vector search. It makes it easy to build high-performance vector search applications by managing and searching through vector embeddings in a scalable and efficient way. Vector embeddings are numerical representations of data that capture their semantic meaning and similarity. For example, you can embed text into vectors using a language model, such that similar texts have similar vectors.

With Pinecone, you can create indexes that store your vector embeddings and metadata, such as document titles or authors. You can then query these indexes using vectors or keywords, and get the most relevant results in milliseconds. Pinecone also handles all the infrastructure and algorithmic complexities behind the scenes, ensuring you get the best performance and results without any hassle.

Some examples of applications that you can build with Pinecone are:

Semantic search: Find documents or products that match the user’s intent or query
Recommendations: Suggest items or content that are similar or complementary to the user’s preferences or behavior
Anomaly detection: Identify outliers or suspicious patterns in data
Generation: Create new content or code that is similar or related to the input

You can learn more about Pinecone from their website or their blog. You can also find pricing details and sign up for a free account here.

Presenting the Python code and explaining its functionality

This code is divided into two parts:

This stage involves preparing the PDF document for querying

This stage pertains to executing queries on the PDF

Below is the Python script that I’ve developed which can be also executed in Google Colab at this link.

PowerShell

# Install the dependencies
pip install langChain
pip install OpenAI
pip install pinecone-client
pip install tiktoken
pip install pypdf

Python

# Provide your OpenAI API key and define the embedding model
OPENAI_API_KEY = "INSERT HERE YOUR OPENAI API KEY"
embed_model = "text-embedding-ada-002"

# Provide your Pinecone API key and specify the environment
PINECONE_API_KEY = "INSERT HERE YOUR PINECONE API KEY"
PINECONE_ENV = "INSERT HERE YOUR PINECONE ENVIRONMENT"

# Import the required modules
import openai, langchain, pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader

# Define a text splitter to handle the 4096 token limit of OpenAI
text_splitter = RecursiveCharacterTextSplitter(
    # We set a small chunk size for demonstration
    chunk_size = 2000,
    chunk_overlap  = 0,
    length_function = len,
)

# Initialize Pinecone with your API key and environment
pinecone.init(
        api_key = PINECONE_API_KEY,
        environment = PINECONE_ENV
)

# Define the index name for Pinecone
index_name = 'pine-search'

# Create an OpenAI embedding object with your API key
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

# Set up an OpenAI LLM model
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)

# Define a PDF loader and load the file
loader = PyPDFLoader("https://lawrence.eti.br/wp-content/uploads/2023/07/ManualdePatentes20210706.pdf")

# Use the text splitter to split the loaded file content into manageable chunks
book_texts = text_splitter.split_documents(file_content)

# Check if the index exists in Pinecone
if index_name not in pinecone.list_indexes():
    print("Index does not exist: ", index_name)

# Create a Pinecone vector search object from the text chunks
book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name)

# Define your query
query = "Como eu faço para depositar uma patente no Brasil?"

# Use the Pinecone vector search to find documents similar to the query
docs = book_docsearch.similarity_search(query)

# Set up a QA chain with the LLM model and the selected chain type
chain = load_qa_chain(llm, chain_type="stuff")

# Run the QA chain with the found documents and your query to get the answer
chain.run(input_documents=docs, question=query)

Below is the application I developed for real-time evaluation of the PDF Search Engine

You can examine the web application that I’ve designed, enabling you to carry out real-time tests of the PDF search engine. This app provides you with the facility to pose questions about the data contained within BRPTO’S Basic Manual for Patent Protection. Click here to launch the application.

Conclusion

In this blog post, I have shown you how to build a simple but powerful PDF search engine using LangChain, Pinecone, and Open AI. This system can help you find the most relevant information from your PDF files in a fast and easy way. You can also extend this system to handle other types of documents, such as images, audio, or video, by using different data loaders and language models.

I hope you enjoyed this tutorial and learned something new. If you have any questions or feedback, please feel free to leave a comment below or contact me here. Thank you for reading!

That’s it for today!

Sources:

GoodAITechnology/LangChain-Tutorials (github.com)

INPI – Instituto Nacional da Propriedade Industrial — Instituto Nacional da Propriedade Industrial (www.gov.br)

Author: Lawrence Teixeira

With over 30 years of expertise in the Technology sector and 18 years in leadership roles as a CTO/CIO, he excels at spearheading the development and implementation of strategic technological initiatives, focusing on system projects, advanced data analysis, Business Intelligence (BI), and Artificial Intelligence (AI). Holding an MBA with a specialization in Strategic Management and AI, along with a degree in Information Systems, he demonstrates an exceptional ability to synchronize cutting-edge technologies with efficient business strategies, fostering innovation and enhancing organizational and operational efficiency. His experience in managing and implementing complex projects is vast, utilizing various methodologies and frameworks such as PMBOK, Agile Methodologies, Waterfall, Scrum, Kanban, DevOps, ITIL, CMMI, and ISO/IEC 27001, to lead data and technology projects. His leadership has consistently resulted in tangible improvements in organizational performance. At the core of his professional philosophy is the exploration of the intersection between data, technology, and business, aiming to unleash innovation and create substantial value by merging advanced data analysis, BI, and AI with a strategic business vision, which he believes is crucial for success and efficiency in any organization. View all posts by Lawrence Teixeira