Have you ever wanted to search through your PDF files and find the most relevant information quickly and easily? If you have a lot of PDF documents, such as books, articles, reports, or manuals, you might find it hard to locate the information you need without opening each file and scanning through the pages. Wouldn’t it be nice if you could type in a query and get the best matches from your PDF collection?
In this blog post, I will show you how to build a simple but powerful PDF search engine using LangChain, Pinecone, and Open AI. By combining these tools, we can create a system that can:
- Extract text and metadata from PDF files.
- Embed the text into vector representations using a language model.
- Index and query the vectors using a vector database.
- Generate natural language responses using the “text-embedding-ada-002” model from Open AI.
What is LangChain?
LangChain is a framework for developing applications powered by language models. It provides modular abstractions for the components necessary to work with language models, such as data loaders, prompters, generators, and evaluators. It also has collections of implementations for these components and use-case-specific chains that assemble these components in particular ways to accomplish a specific task.

Prompts: This part allows you to create adaptable instructions using templates. It can adjust to different language learning models based on the size of the conversation window and input factors like conversation history, search results, previous answers, and more.
Models: This part serves as a bridge to connect with most third-party language learning models. It has connections to roughly 40 public language learning models, chat, and text representation models.
Memory: This allows the language learning models to remember the conversation history.
Indexes: Indexes are methods to arrange documents so that language learning models can interact with them effectively. This part includes helpful functions for dealing with documents and connections to different database systems for storing vectors (numeric representations of text).
Agents: Some applications don’t just need a set sequence of calls to language learning models or other tools, but possibly an unpredictable sequence based on the user’s input. In these sequences, there’s an agent that has access to a collection of tools. Depending on the user’s input, the agent can decide which tool – if any – to use.
Chains: Using a language learning model on its own is fine for some simple applications, but more complex ones need to link multiple language learning models, either with each other or with other experts. LangChain offers a standard interface for these chains, as well as some common chain setups for easy use.
With LangChain, you can build applications that can:
- Connect a language model to other sources of data, such as documents, databases, or APIs
- Allow a language model to interact with its environments, such as chatbots, agents, or generators
- Optimize the performance and quality of a language model using feedback and reinforcement learning
Some examples of applications that you can build with LangChain are:
- Question answering over specific documents
- Chatbots that can access external knowledge or services
- Agents that can perform tasks or solve problems using language models
- Generators that can create content or code using language models
You can learn more about LangChain from their documentation or their GitHub repository. You can also find tutorials and demos in different languages, such as Chinese, Japanese, or English.
What is Pinecone?

Pinecone is a vector database for vector search. It makes it easy to build high-performance vector search applications by managing and searching through vector embeddings in a scalable and efficient way. Vector embeddings are numerical representations of data that capture their semantic meaning and similarity. For example, you can embed text into vectors using a language model, such that similar texts have similar vectors.
With Pinecone, you can create indexes that store your vector embeddings and metadata, such as document titles or authors. You can then query these indexes using vectors or keywords, and get the most relevant results in milliseconds. Pinecone also handles all the infrastructure and algorithmic complexities behind the scenes, ensuring you get the best performance and results without any hassle.
Some examples of applications that you can build with Pinecone are:
- Semantic search: Find documents or products that match the user’s intent or query
- Recommendations: Suggest items or content that are similar or complementary to the user’s preferences or behavior
- Anomaly detection: Identify outliers or suspicious patterns in data
- Generation: Create new content or code that is similar or related to the input
You can learn more about Pinecone from their website or their blog. You can also find pricing details and sign up for a free account here.
Presenting the Python code and explaining its functionality
This code is divided into two parts:


Below is the Python script that I’ve developed which can be also executed in Google Colab at this link.
# Install the dependencies
pip install langChain
pip install OpenAI
pip install pinecone-client
pip install tiktoken
pip install pypdf
# Provide your OpenAI API key and define the embedding model
OPENAI_API_KEY = "INSERT HERE YOUR OPENAI API KEY"
embed_model = "text-embedding-ada-002"
# Provide your Pinecone API key and specify the environment
PINECONE_API_KEY = "INSERT HERE YOUR PINECONE API KEY"
PINECONE_ENV = "INSERT HERE YOUR PINECONE ENVIRONMENT"
# Import the required modules
import openai, langchain, pinecone
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader, PyPDFLoader
# Define a text splitter to handle the 4096 token limit of OpenAI
text_splitter = RecursiveCharacterTextSplitter(
# We set a small chunk size for demonstration
chunk_size = 2000,
chunk_overlap = 0,
length_function = len,
)
# Initialize Pinecone with your API key and environment
pinecone.init(
api_key = PINECONE_API_KEY,
environment = PINECONE_ENV
)
# Define the index name for Pinecone
index_name = 'pine-search'
# Create an OpenAI embedding object with your API key
embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
# Set up an OpenAI LLM model
llm = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
# Define a PDF loader and load the file
loader = PyPDFLoader("https://lawrence.eti.br/wp-content/uploads/2023/07/ManualdePatentes20210706.pdf")
# Use the text splitter to split the loaded file content into manageable chunks
book_texts = text_splitter.split_documents(file_content)
# Check if the index exists in Pinecone
if index_name not in pinecone.list_indexes():
print("Index does not exist: ", index_name)
# Create a Pinecone vector search object from the text chunks
book_docsearch = Pinecone.from_texts([t.page_content for t in book_texts], embeddings, index_name = index_name)
# Define your query
query = "Como eu faço para depositar uma patente no Brasil?"
# Use the Pinecone vector search to find documents similar to the query
docs = book_docsearch.similarity_search(query)
# Set up a QA chain with the LLM model and the selected chain type
chain = load_qa_chain(llm, chain_type="stuff")
# Run the QA chain with the found documents and your query to get the answer
chain.run(input_documents=docs, question=query)
Below is the application I developed for real-time evaluation of the PDF Search Engine
You can examine the web application that I’ve designed, enabling you to carry out real-time tests of the PDF search engine. This app provides you with the facility to pose questions about the data contained within BRPTO’S Basic Manual for Patent Protection. Click here to launch the application.

Conclusion
In this blog post, I have shown you how to build a simple but powerful PDF search engine using LangChain, Pinecone, and Open AI. This system can help you find the most relevant information from your PDF files in a fast and easy way. You can also extend this system to handle other types of documents, such as images, audio, or video, by using different data loaders and language models.
I hope you enjoyed this tutorial and learned something new. If you have any questions or feedback, please feel free to leave a comment below or contact me here. Thank you for reading!
That’s it for today!
Sources:
GoodAITechnology/LangChain-Tutorials (github.com)