Have you ever wanted to search through your PDF files and find the most relevant information quickly and easily? If you have a lot of PDF documents, such as books, articles, reports, or manuals, you might find it hard to locate the information you need without opening each file and scanning through the pages. Wouldn’t it be nice if you could type in a query and get the best matches from your PDF collection?
In this blog post, I will show you how to build a simple but powerful PDF search engine using LangChain, Pinecone, and Open AI. By combining these tools, we can create a system that can:
- Extract text and metadata from PDF files.
- Embed the text into vector representations using a language model.
- Index and query the vectors using a vector database.
- Generate natural language responses using the “text-embedding-ada-002” model from Open AI.
What is LangChain?
LangChain is a framework for developing applications powered by language models. It provides modular abstractions for the components necessary to work with language models, such as data loaders, prompters, generators, and evaluators. It also has collections of implementations for these components and use-case-specific chains that assemble these components in particular ways to accomplish a specific task.
Prompts: This part allows you to create adaptable instructions using templates. It can adjust to different language learning models based on the size of the conversation window and input factors like conversation history, search results, previous answers, and more.
Models: This part serves as a bridge to connect with most third-party language learning models. It has connections to roughly 40 public language learning models, chat, and text representation models.
Memory: This allows the language learning models to remember the conversation history.
Indexes: Indexes are methods to arrange documents so that language learning models can interact with them effectively. This part includes helpful functions for dealing with documents and connections to different database systems for storing vectors (numeric representations of text).
Agents: Some applications don’t just need a set sequence of calls to language learning models or other tools, but possibly an unpredictable sequence based on the user’s input. In these sequences, there’s an agent that has access to a collection of tools. Depending on the user’s input, the agent can decide which tool – if any – to use.
Chains: Using a language learning model on its own is fine for some simple applications, but more complex ones need to link multiple language learning models, either with each other or with other experts. LangChain offers a standard interface for these chains, as well as some common chain setups for easy use.
With LangChain, you can build applications that can:
- Connect a language model to other sources of data, such as documents, databases, or APIs
- Allow a language model to interact with its environments, such as chatbots, agents, or generators
- Optimize the performance and quality of a language model using feedback and reinforcement learning
Some examples of applications that you can build with LangChain are:
- Question answering over specific documents
- Chatbots that can access external knowledge or services
- Agents that can perform tasks or solve problems using language models
- Generators that can create content or code using language models
What is Pinecone?
Pinecone is a vector database for vector search. It makes it easy to build high-performance vector search applications by managing and searching through vector embeddings in a scalable and efficient way. Vector embeddings are numerical representations of data that capture their semantic meaning and similarity. For example, you can embed text into vectors using a language model, such that similar texts have similar vectors.
With Pinecone, you can create indexes that store your vector embeddings and metadata, such as document titles or authors. You can then query these indexes using vectors or keywords, and get the most relevant results in milliseconds. Pinecone also handles all the infrastructure and algorithmic complexities behind the scenes, ensuring you get the best performance and results without any hassle.
Some examples of applications that you can build with Pinecone are:
- Semantic search: Find documents or products that match the user’s intent or query
- Recommendations: Suggest items or content that are similar or complementary to the user’s preferences or behavior
- Anomaly detection: Identify outliers or suspicious patterns in data
- Generation: Create new content or code that is similar or related to the input
Presenting the Python code and explaining its functionality
This code is divided into two parts:
Below is the Python script that I’ve developed which can be also executed in Google Colab at this link.
# Install the dependencies pip install langChain pip install OpenAI pip install pinecone-client pip install tiktoken pip install pypdf
# Provide your OpenAI API key and define the embedding model OPENAI_API_KEY = "INSERT HERE YOUR OPENAI API KEY" embed_model = "text-embedding-ada-002" # Provide your Pinecone API key and specify the environment PINECONE_API_KEY = "INSERT HERE YOUR PINECONE API KEY" PINECONE_ENV = "INSERT HERE YOUR PINECONE ENVIRONMENT" # Import the required modules import openai, langchain, pinecone from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.vectorstores import Pinecone from langchain.llms import OpenAI from langchain.chains.question_answering import load_qa_chain from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader # Define a text splitter to handle the 4096 token limit of OpenAI text_splitter = RecursiveCharacterTextSplitter( # We set a small chunk size for demonstration chunk_size = 2000, chunk_overlap = 0, length_function = len, ) # Initialize Pinecone with your API key and environment pinecone.init( api_key = PINECONE_API_KEY, environment = PINECONE_ENV ) # Define the index name for Pinecone index_name = 'pine-search' # Create an OpenAI embedding object with your API key embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY) # Set up an OpenAI LLM model llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY) # Define a PDF loader and load the file loader = PyPDFLoader("https://lawrence.eti.br/wp-content/uploads/2023/07/ManualdePatentes20210706.pdf") # Use the text splitter to split the loaded file content into manageable chunks book_texts = text_splitter.split_documents(file_content) # Check if the index exists in Pinecone if index_name not in pinecone.list_indexes(): print("Index does not exist: ", index_name) # Create a Pinecone vector search object from the text chunks book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name) # Define your query query = "Como eu faço para depositar uma patente no Brasil?" # Use the Pinecone vector search to find documents similar to the query docs = book_docsearch.similarity_search(query) # Set up a QA chain with the LLM model and the selected chain type chain = load_qa_chain(llm, chain_type="stuff") # Run the QA chain with the found documents and your query to get the answer chain.run(input_documents=docs, question=query)
Below is the application I developed for real-time evaluation of the PDF Search Engine
You can examine the web application that I’ve designed, enabling you to carry out real-time tests of the PDF search engine. This app provides you with the facility to pose questions about the data contained within BRPTO’S Basic Manual for Patent Protection. Click here to launch the application.
In this blog post, I have shown you how to build a simple but powerful PDF search engine using LangChain, Pinecone, and Open AI. This system can help you find the most relevant information from your PDF files in a fast and easy way. You can also extend this system to handle other types of documents, such as images, audio, or video, by using different data loaders and language models.
I hope you enjoyed this tutorial and learned something new. If you have any questions or feedback, please feel free to leave a comment below or contact me here. Thank you for reading!
That’s it for today!