Cost-Effective Text Embedding: Leveraging Ollama Local Models with Azure SQL Databases

Embedding text using a local model can provide significant cost advantages and flexibility over cloud-based services. In this blog post, we explore how to set up and use a local model for text embedding and how this approach can be integrated with Azure SQL databases for advanced querying capabilities.

Cost Comparison: Open AI text-embedding-ada-002 pay Model vs. Local Model Setup Cost

When choosing between a paid service and setting up a local model for text embedding, it’s crucial to consider the cost implications based on the scale of your data and the frequency of usage. Below is a detailed comparison of the costs of using a paid model versus establishing a local one.

Pay Model Cost Estimate:

Open AI text-embedding-ada-002:

Using a paid model like OpenAI’s Ada V2 for embedding 1 terabyte of OCR texts would cost around $25,000. This estimation is based on converting every 4 characters into one token, which might vary depending on the content and structure of the OCR texts.

Local Model Cost Estimate:

Setup Costs:

The initial investment for setting up a local model can range from $4,050 to $12,750, depending on the selection of components, from mid-range to high-end. This one-time cost can be amortized over many uses and datasets, potentially offering a more cost-effective solution in the long run, especially for large data volumes.

Overall Financial Implications

While the upfront cost for a local model might seem high, it becomes significantly more economical with increased data volumes and repeated use. In contrast, the cost of using a pay model like OpenAI’s text-embedding-ada-002 scales linearly with data volume, leading to potentially high ongoing expenses.

Considering these factors, the local model offers a cost advantage and greater control over data processing and security, making it an attractive option for organizations handling large quantities of sensitive data.

Why I Have Decided to Use a Local Model?

Cost and data volume considerations primarily drove the decision to use a local model for text embedding. With over 20 terabytes of data, including 1 terabyte of OCR text to embed, the estimated cost of using a commercial text-embedding model like OpenAI’s text-embedding-ada-002 would be around USD 25,000. By setting up a local model, we can process our data at a fraction of this cost, reducing expenses by 49% to 84%.

Exploring Local Models: Testing BGE-M3, MXBAI-EMBED-LARGE, NOMIC-EMBED-TEXT, and text-embedding-ada-002 from Open AI.

I encountered some intriguing results in my recent tests with local embedding models BGE-M3 and NOMIC-EMBED-TEXT. Both models showed an accuracy below 0.80 when benchmarked against OpenAI’s “Text-embedding-ada-002.” This comparison has sparked a valuable discussion about the capabilities and limitations of different embedding technologies.

How to Choose the Best Model for Your Needs?

When considering open-source embedding models like NOMIC-EMBED-TEXT, BGE-M3, and MXBAI-EMBED-LARGE, specific strengths and applications that make them suitable for various machine learning tasks should be considered.

1. NOMIC-EMBED-TEXT: This model is specifically designed for handling long-context text, making it suitable for tasks that involve processing extensive documents or content that benefits from understanding broader contexts. It achieves this by training on full Wikipedia articles and various large-scale question-answering datasets, which helps it capture long-range dependencies.

2. BGE-M3: Part of the BGE (Beijing Academy of Artificial Intelligence) series, this model is adapted for sentence similarity tasks. It’s built to handle multilingual content effectively, which makes it a versatile choice for applications requiring understanding or comparing sentences across different languages.

3. MXBAI-EMBED-LARGE: This model is noted for its feature extraction capabilities, making it particularly useful for tasks that require distilling complex data into simpler, meaningful representations. Its training involves diverse datasets, enhancing its generalization across text types and contexts.

Each model brings unique capabilities, such as handling longer texts or providing robust multilingual support. When choosing among these models, consider the specific needs of your project, such as the length of text you need to process, the importance of multilingual capabilities, and the type of machine learning tasks you aim to perform (e.g., text similarity, feature extraction). Testing them with specific data is crucial to determine which model performs best in your context.

In our analysis, we’ve compared various results and identified the best open-source model to use compared to the OpenAI’s Text-embedding-ada-002.

We executed this query using the keyword ‘Microsoft’ to search the vector table and compare the content of Wikipedia articles.

SQL
declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'
select w.title, w.text from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 1, 0.25) 
order by dot_product desc) w
order by w.title
go

We utilized the KMeans compute node for text similarity analysis, focusing on a single cluster search. For a detailed, step-by-step guide on creating this dataset, please refer to the article I shared at the end of this article.

Follow the results overview:

To calculate the percentage of similarity of each model with “Text-embedding-ada-002”, we’ll determine how many keywords match between “Text-embedding-ada-002” and the other models, then express this as a percentage of the total keywords in “Text-embedding-ada-002”. Here’s the updated table with the percentages:

Follow the comparison table:

  1. Text-embedding-ada-002 Keywords Total: 10 (100% is based on these keywords).
  2. Matching Keywords:
       – BGE-M3: Matches 7 out of 10 keywords of Text-embedding-ada-002.
       – NOMIC-EMBED-TEXT: Matches 3 out of 10 keywords of Text-embedding-ada-002.
       – MXBAI-EMBED-LARGE: Matches 1 out of 10 keywords of Text-embedding-ada-002.

This table illustrates that the BGE-M3 model is similar to “Text-embedding-ada-002,” with 70% of the keywords matching. It is followed by “NOMIC-EMBED-TEXT” at 30% and “MXBAI-EMBED-LARGE,” with the least similarity at 10%.

How does it perform when doing an approximate search with 1, 4, 8, and 16 clusters?

We execute this query within the Azure database to perform this test across each database and model we use:

SQL
create table #trab ( linha varchar( 200) null )

insert into #trab (linha) values ('Model: mxbai-embed-large')

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 1 cluster')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 1, 0.25) 
order by dot_product desc) w
order by w.title
go

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 4 clusters')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 4, 0.25) 
order by dot_product desc) w
order by w.title
go

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 8 clusters')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 8, 0.25) 
order by dot_product desc) w
order by w.title
go

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 16 clusters')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 16, 0.25) 
order by dot_product desc) w
order by w.title
go

select * from #trab

drop table #trab

Follow the results overview:

Based on the previous detailed list, here are the calculations for the percentage of similarity:

1. Total Distinct Keywords in Text-embedding-ada-002: 10 (100% based on these keywords)

2. Keywords in each Cluster Search:

   – BGE-M3: 5 keywords (Microsoft, Microsoft Office, Microsoft Windows, Microsoft Word, MSN)

   – NOMIC-EMBED-TEXT: 4 keywords (Microsoft, MSN, Nokia, Outlook.com)

   – MXBAI-EMBED-LARGE: 2 keywords (Microsoft, Nokia)

Here’s the updated table with the percentage similarity for searches with 1, 4, 8, and 16 clusters:

This table shows the similarity percentages for each model across different cluster configurations compared to the “text-embedding-ada-002” model. Each model retains a consistent similarity percentage across all cluster numbers, indicating that the cluster configuration did not affect the keywords searched for in these cases.

To execute the Python code to embed the vectors, first, you have to install Ollama

How Did You Set Up a Local Model Using Ollama?

To run an Ollama model with your GPU, you can use the official Docker image provided by Ollama. The Docker image supports Nvidia GPUs and can be installed using the NVIDIA Container Toolkit. Here are the steps to get started:

  1. Install Docker: Download and install Docker Desktop or Docker Engine, depending on your operating system.
  2. Select and Pull the Ollama Model: Choose a preferred model from the Ollama library, such as nomic-embed-text or mxbai-embed-large, and pull it using the following command: docker pull ollama/ollama..
  3. Run the Ollama Docker Image: Execute Docker run commands to set up the Ollama container. You can configure it specifically for either CPU or Nvidia GPU environments. Run the Docker container with the following command: docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama.
  4. You can now run the Ollama model using the following command: docker exec -it ollama ollama run nomic-embed-text  or docker exec -it ollama ollama run mxbai-embed-large .
  5. Access and Use the Model: To start interacting with your model, utilize the Ollama WebUI by navigating to the local address provided (typically http://localhost:11434).

Please note that the above commands assume you have already installed Docker on your system. If you haven’t installed Docker yet, you can download it from the official Docker website.

You can also download and install Ollama on Windows:

How do you convert text into embedding using the local model using Ollama?

After setting up your local model with Ollama, you can use the following Python script to convert text into embeddings:

Python
# Importing necessary libraries and modules
import os
import pyodbc  # SQL connection library for Microsoft databases
import requests  # For making HTTP requests
from dotenv import load_dotenv  # To load environment variables from a .env file
import numpy as np  # Library for numerical operations
from sklearn.preprocessing import normalize  # For normalizing data
import json  # For handling JSON data
from db.utils import NpEncoder  # Custom JSON encoder for numpy data types

# Load environment variables from a .env file located in the same directory
load_dotenv()

# This is the connection string for connecting to the Azure SQL database we are getting from the environment variables
#MSSQL='Driver={ODBC Driver 17 for SQL Server};Server=localhost;Database=<DATABASE NAME>;Uid=<USER>;Pwd=<PASSWOPRD>;Encrypt=No;Connection Timeout=30;'

# Retrieve the database connection string from environment variables
dbconnectstring = os.getenv('MSSQL')

# Establish a connection to the Azure SQL database using the connection string
conn = pyodbc.connect(dbconnectstring)

def get_embedding(text, model):
    # Prepare the input text by truncating it or preprocessing if needed
    truncated_text = text

    # Make an HTTP POST request to a local server API to get embeddings for the input text
    res = requests.post(url='http://localhost:11434/api/embeddings',
                        json={
                            'model': model, 
                            'prompt': truncated_text
                        }
    )
    
    # Extract the embedding from the JSON response
    embeddings = res.json()['embedding']
    
    # Convert the embedding list to a numpy array
    embeddings = np.array(embeddings)    
    
    # Normalize the embeddings array to unit length
    nc = normalize([embeddings])
        
    # Convert the numpy array back to JSON string using a custom encoder that handles numpy types
    return json.dumps(nc[0], cls=NpEncoder )

def update_database(id, title_vector, content_vector):
    # Obtain a new cursor from the database connection
    cursor = conn.cursor()

    # Convert numpy array embeddings to string representations for storing in SQL
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # SQL query to update the embeddings in the database
    cursor.execute("""
        UPDATE wikipedia_articles_embeddings
        SET title_vector = ?, content_vector = ?
        WHERE id = ?
    """, (title_vector_str, content_vector_str, id))
    conn.commit()  # Commit the transaction to the database

def embed_and_update(model):
    # Get a cursor from the database connection
    cursor = conn.cursor()
    
    # Retrieve articles from the database that need their embeddings updated
    cursor.execute("select id, title, text from wikipedia_articles_embeddings where title_vector = '' or content_vector = '' order by id desc")
    
    for row in cursor.fetchall():
        id, title, text = row
        
        # Get embeddings for title and text
        title_vector = get_embedding(title, model)
        content_vector = get_embedding(text, model)
        
        # Print the progress with length of the generated embeddings
        print(f"Embedding article {id} - {title}", "len:", len(title_vector), len(content_vector))
        
        # Update the database with new embeddings
        update_database(id, title_vector, content_vector)

# Call the function to update embeddings using the 'nomic-embed-text' model
embed_and_update('nomic-embed-text')

# To use another model, uncomment and call the function with the different model name
# embed_and_update('mxbai-embed-large')

I’ve also created a GitHub repository with these codes; you can access it at this link.

Download the pre-calculated embeddings using OpenAI’s text-embedding-ada-002

The pre-calculated embeddings with OpenAI’s text-embedding-ada-002, both for the title and the body, of a selection of Wikipedia articles, is made available by Open AI here:

https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip

Once you have successfully embedded your text, I recommend exploring two of my blog posts that detail how to create a vector database for prompting and searching. These posts provide step-by-step guidance on utilizing Azure SQL alongside cosine similarity and KMeans algorithms for efficient and effective data retrieval.

Azure SQL Database now has native vector support

You can sign up for the private preview at this link.

This article, published by Davide Mauri and Pooja Kamath during this week’s Microsoft Build event, provides all the information.

Announcing EAP for Vector Support in Azure SQL Database – Azure SQL Devs’ Corner (microsoft.com)

Conclusion

Embedding text locally using models like Ollama presents a cost-effective, scalable solution for handling large volumes of data. By integrating these embeddings into Azure SQL databases, organizations can leverage generative AI to enhance their querying capabilities, making extracting meaningful insights from vast datasets easier. The outlined process ensures significant cost savings and enhances data security and processing efficiency.

This approach is a technical exercise and a strategic asset that can drive better decision-making and innovation across various data-intensive applications.

That’s it for today!

Sources

GitHub – Azure-Samples/azure-sql-db-vectors-kmeans: Use KMeans clustering to speed up vector search in Azure SQL DB

Vector Similarity Search with Azure SQL database and OpenAI | by Davide Mauri | Microsoft Azure | Medium

Ollama

How to Install and Run Ollama with Docker: A Beginner’s Guide – Collabnix

Leveraging KMeans Compute Node for Text Similarity Analysis through Vector Search in Azure SQL – Tech News & Insights (lawrence.eti.br)

Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide How to Use Generative AI to Prompt Queries in Datasets – Tech News & Insights (lawrence.eti.br)

GitHub – LawrenceTeixeira/embedyourlocalmodel

Leveraging KMeans Compute Node for Text Similarity Analysis through Vector Search in Azure SQL

In the ever-evolving landscape of data management and retrieval, the ability to efficiently search through high-dimensional vector data has become a cornerstone for many modern applications, including recommendation systems, image recognition, and natural language processing tasks. Azure SQL Database (DB), in combination with KMeans clustering, is at the forefront of this revolution, offering an innovative solution that significantly enhances vector search capabilities.

What is KMeans?

KMeans is a widely used clustering algorithm in machine learning and data mining. It’s a method for partitioning an N-dimensional dataset into K distinct, non-overlapping clusters. Each cluster is defined by its centroid, which is the mean of the points in the cluster. The algorithm aims to minimize the variance within each cluster, effectively grouping the data points into clusters based on their similarity.

Let’s implement an example of the code to understand how it works Import the NumPy module and then go through the rest of the code to know how the K-Means clustering is implemented.

Python
#Loading the required modules
 
import numpy as np
from scipy.spatial.distance import cdist 
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
 
#Defining our function 
def kmeans(x,k, no_of_iterations):
    idx = np.random.choice(len(x), k, replace=False)
    #Randomly choosing Centroids 
    centroids = x[idx, :] #Step 1
     
    #finding the distance between centroids and all the data points
    distances = cdist(x, centroids ,'euclidean') #Step 2
     
    #Centroid with the minimum Distance
    points = np.array([np.argmin(i) for i in distances]) #Step 3
     
    #Repeating the above steps for a defined number of iterations
    #Step 4
    for _ in range(no_of_iterations): 
        centroids = []
        for idx in range(k):
            #Updating Centroids by taking mean of Cluster it belongs to
            temp_cent = x[points==idx].mean(axis=0) 
            centroids.append(temp_cent)
 
        centroids = np.vstack(centroids) #Updated Centroids 
         
        distances = cdist(x, centroids ,'euclidean')
        points = np.array([np.argmin(i) for i in distances])
         
    return points 
 
 
#Load Data
data = load_digits().data
pca = PCA(2)
  
#Transform the data
df = pca.fit_transform(data)
 
#Applying our function
label = kmeans(df,10,1000)
 
#Visualize the results
 
u_labels = np.unique(label)
for i in u_labels:
    plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.legend()
plt.show()

How does Voronoi Cell-based Vector Search Optimization work?

Vector Search Optimization via Voronoi Cells is an advanced technique to enhance the efficiency and accuracy of searching for similar vectors in a high-dimensional space. This method is particularly relevant in the context of Approximate Nearest Neighbor (ANN) searches, which aim to quickly find vectors close to a given query vector without exhaustively comparing the query vector against every other vector in the dataset.

Understanding Voronoi Cells

To grasp the concept of vector search optimization via Voronoi Cells, it’s essential to understand what Voronoi diagrams are. A Voronoi diagram is partitioning a plane into regions based on the distance to points in a specific subset of the plane. Each region (Voronoi cell) is defined so that any point within the region is closer to its corresponding “seed” point than any other. These seed points are typically referred to as centroids in the context of vector search.

Application in Vector Search

Voronoi Cells can efficiently partition the high-dimensional space into distinct regions in vector search. Each area represents a cluster of vectors closer to its centroid than any other centroid. This approach is based on the assumption that vectors within the same Voronoi cell are more likely to be similar to each other than to vectors in different cells.

The Process

  1. Centroid Initialization: Like KMeans clustering, the process begins by selecting a set of initial centroids in the high-dimensional space.
  2. Voronoi Partitioning: The space is partitioned into Voronoi cells, each associated with one centroid. This partitioning is done such that every vector in the dataset is assigned to the cell of the closest centroid.
  3. Indexing and Search Optimization: Once the high-dimensional space is partitioned, an inverted file index or a similar data structure can be created to map each centroid to the list of vectors (or pointers to them) within its corresponding Voronoi cell. During a search query, instead of comparing the query vector against all vectors in the dataset, the search can be limited to vectors within the most relevant Voronoi cells, significantly reducing the search space and time.

Advantages

  • Efficiency: By reducing the search space to a few relevant Voronoi cells, the algorithm can achieve faster search times than brute-force searches.
  • Scalability: This method scales better with large datasets, as the overhead of partitioning the space and indexing is compensated by the speedup in query times.
  • Flexibility: The approach can be adapted to various data types and dimensionalities by adjusting the centroid selection and cell partitioning methods.

Introducing the Project of Azure SQL DB and KMeans Compute Node

Azure SQL DB has long been recognized for its robustness, scalability, and security as a cloud database service. By integrating KMeans clustering—a method used to partition data into k distinct clusters based on similarity—the capability of Azure SQL DB is expanded to include advanced vector search operations.

The KMeans Compute Node is a specialized component that handles the compute-intensive task of clustering high-dimensional data. This integration optimizes the performance of vector searches and simplifies the management and deployment of these solutions.

How It Works

  1. Data Storage: Vector data is stored in Azure SQL DB, leveraging its high availability and scalable storage solutions. This setup ensures that data management adheres to best practices regarding security and compliance.
  2. Vector Clustering: The KMeans Compute Node performs clustering on the vector data. This process groups vectors into clusters based on similarity, significantly reducing the search space for query operations.
  3. Search Optimization: Approximate Nearest Neighbor (ANN) searches can be executed more efficiently with vectors organized into clusters. Queries are directed towards relevant clusters rather than the entire dataset, enhancing search speed and accuracy.
  4. Seamless Integration: The entire process is streamlined through Azure Container Apps, which host the KMeans Compute Node. This setup provides a scalable, serverless environment that dynamically adjusts demand-based resources.

Advantages of the Azure SQL DB and KMeans Approach

  • Performance: By reducing the complexity of vector searches, response times are significantly improved, allowing for real-time search capabilities even in large datasets.
  • Scalability: The solution effortlessly scales with your data, thanks to Azure’s cloud infrastructure. This ensures that growing data volumes do not compromise search efficiency.
  • Cost-Effectiveness: Azure SQL DB offers a cost-efficient storage solution, while using Azure Container Apps for the KMeans Compute Node optimizes resource utilization, reducing overall expenses.
  • Simplicity: The integration simplifies the architecture of vector search systems, making it easier to deploy, manage, and maintain these solutions.

Use Cases and Applications

The Azure SQL DB and KMeans Compute Node solution is versatile, supporting a wide range of applications:

1-Analysis of Similarity in Court Decisions
  • Legal Research and Precedents: By analyzing the similarities in court decisions, legal professionals can efficiently find relevant precedents to support their cases. This application can significantly speed up legal research, ensuring lawyers can access comprehensive and pertinent case law that aligns closely with their current matters.
2-Personalized Medicine and Genomic Data Analysis
  • Drug Response Prediction: Leveraging vector search to analyze genomic data allows researchers to predict how patients might respond to different treatments. By clustering patients based on genetic markers, medical professionals can tailor treatments to individual genetic profiles, advancing the field of personalized medicine.
3-Market Trend Analysis
  • Consumer Behavior Clustering: Businesses can cluster consumer behavior data to identify emerging market trends and tailor their marketing strategies accordingly. Vector search can help analyze high-dimensional data, such as purchase history and online behavior, to segment consumers into groups with similar preferences and behaviors.
4-Cybersecurity Threat Detection
  • Anomaly Detection in Network Traffic: Vector search can monitor network traffic, identifying unusual patterns that may indicate cybersecurity threats. By clustering network events, it’s possible to quickly isolate and investigate anomalies, enhancing an organization’s ability to respond to potential security breaches.
5-Educational Content and Learning Style Personalization
  • Matching Educational Materials to Learning Styles: By clustering educational content and student profiles, educational platforms can personalize learning experiences. Vector search can identify the most suitable materials and teaching methods for different learning styles, improving student engagement and outcomes.
6-Environmental Monitoring and Conservation Efforts
  • Species Distribution Modeling: Vector search can analyze environmental data to model the distribution of various species across different habitats. This information is crucial for conservation planning, helping identify critical areas for biodiversity conservation.
7-Supply Chain Optimization
  • Predictive Maintenance and Inventory Management: In supply chain management, vector search can cluster equipment performance data to predict maintenance needs and optimize inventory levels. This application ensures that operations run smoothly, with minimal downtime and efficient use of resources.
8-Creative Industries and Content Creation
  • Similarity Analysis in Music and Art: Artists and creators can use vector search to analyze patterns and themes in music, art, and literature. This approach can help understand influences, trends, and the evolution of styles over time, providing valuable insights for new creations.

    Architecture of the project

    The project’s architecture is straightforward as it comprises a single container that exposes a REST API to build and rebuild the index and search for similar vectors. The container is deployed to Azure Container Apps and uses Azure SQL DB to store the vectors and the clusters.

    The idea is that compute-intensive operations, like calculating KMeans, can be offloaded to a dedicated container that is easy to deploy, quick to start, and offers serverless scaling for the best performance/cost ratio.

    Once the container runs, it is entirely independent of the database and can work without affecting database performance. Even better, if more scalability is needed, data can be partitioned across multiple container instances to achieve parallelism.

    Once the model has been trained, the identified clusters and centroids – and thus the IVF index – are saved back to the SQL DB so that they can be used to perform ANN search on the vector column without the need for the container to remain active. The container can be stopped entirely as SQL DB is completely autonomous now.

    The data is stored back in SQL DB using the following tables:

    • [$vector].[kmeans]: stores information about created indexes
    • [$vector].[<table_name>$<column_name>$clusters_centroids]: stores the centroids
    • [$vector].[<table_name>$<column_name>$clusters]: the IVF structure, associating each centroid to the list of vectors assigned to it

    to search even more accessible, a function is created also:

    • [$vector].[find_similar$<table_name>$<column_name>](<vector>, <probe cells count>, <similarity threshold>): the function to perform ANN search

    The function calculates the dot product, the same as the cosine similarity if vectors are normalized to 1.

    Also, the function:

    • [$vector].[find_cluster$<table_name>$<column_name>](<vector>): find the cluster of a given vector

    is provided as it is needed to insert new vectors into the IVF index.

    Implementation

    The project is divided into two GitHub repositories: one with Python source code for the KMeans compute node created by Davide Mauri, Principal Product Manager in Azure SQL DB at Microsoft, and the other part with the actual example app I have created to test the project.

    1. Azure SQL DB Vector – KMeans Compute Node

    KMeans model from Scikit Learn is executed within a container as a REST endpoint. The API exposed by the container are:

    • Server Status: GET /
    • Build Index: POST /kmeans/build
    • Rebuild Index: POST /kmeans/rebuild

    Both Build and Rebuild API are asynchronous. The Server Status API can be used to check the status of the build process.

    Build Index

    To build an index from scratch, the Build API expects the following payload:

    JSON
    {
      "table": {
        "schema": <schema name>,
        "name": <table name>
      },
      "column": {
        "id": <id column name>,
        "vector": <vector column name>
      },
      "vector": {
        "dimensions": <dimensions>
      }
    }

    Using the aforementioned Wikipedia dataset, the payload would be:

    JSON
    POST /kmeans/build
    {
        "table": {
            "schema": "dbo",
            "name": "news"
        },
        "column": {
            "id": "article_id",
            "vector": "content_vector"
        },
        "vector": {
            "dimensions": 1536
        }
    }

    The API would verify that the request is correct and then start the build process asynchronously, returning the ID assigned to the index being created:

    JSON
    {
      "id": 1,
      "status": {
        "status": {
          "current": "initializing",
          "last": "idle"
        },
        "index_id": "1"
      }
    }

    The API will return an error if the index on the same table and vector column already exists. If you want to force the creation of a new index over the existing one, you can use the force Option:

    POST /kmeans/build?force=true

    Rebuild Index

    If you need to rebuild an existing index, you can use the Rebuild API. The API doesn’t need a payload, as it will use the existing index definition. Just like the build process, the rebuild process is also asynchronous. The index to be rebuilt is specified via the URL path:

    POST /kmeans/rebuild/<index id>

    For example, to rebuild the index with id 1:

    POST /kmeans/rebuild/1

    Query API Status

    The status of the build process can be checked using the Server Status API:

    GET /

    And you’ll get the current status and the last status report:

    JSON
    {
      "server": {
        "status": {
          "current": "building",
          "last": "initializing"
        },
        "index_id": 1
      },
      "version": "0.0.1"
    }

    Checking the previous status is helpful to understand whether an error occurred during the build process.

    You can also check the index build status by querying the [$vector].[kmeans] table.

    2. Leveraging KMeans Compute Node for Text Similarity Analysis through Vector Search in Azure SQL

    Search for similar vectors.


    Once you have built the index, you can search for similar vectors. Using the sample dataset, you can search for the 10 most similar news to ‘How Generative AI Is Transforming Today’s And Tomorrow’s Software Development Life Cycle.’ using the find_similar function created as part of the index build process. For example:

    SQL
    /*
        This SQL code is used to search for similar news articles based on a given input using vector embeddings.
        It makes use of an external REST endpoint to retrieve the embeddings for the input text.
        The code then calls the 'find_similar$news$content_vector' function to find the top 10 similar news articles.
        The similarity is calculated based on the dot product of the embeddings.
        The result is ordered by the dot product in descending order.
    */
    
    declare @response nvarchar(max);
    declare @payload nvarchar(max) = json_object('input': 'How Generative AI Is Transforming Today’s And Tomorrow’s Software Development Life Cycle.');
    
    exec sp_invoke_external_rest_endpoint
        @url = 'https://<YOUR APP>.openai.azure.com/openai/deployments/embeddings/embeddings?api-version=2023-03-15-preview',
        @credential = [https://<YOUR APP>.openai.azure.com],
        @payload = @payload,
        @response = @response output;
    
    select top 10 r.published, r.category, r.author, r.title, r.content, r.dot_product
    from [$vector].find_similar$news$content_vector(json_query(@response, '$.result.data[0].embedding'), 50, 0.80)  AS r
    order by dot_product desc

    The find_similar the function takes three parameters:

    • the vector to search for
    • the number of clusters to search in
    • the similarity threshold

    The similarity threshold filters out vectors that are not similar enough to the query vector. The higher the threshold, the more similar the vectors returned will be. The number of clusters to search in is used to speed up the search. The higher the number of clusters, the more similar the vectors returned will be. The lower the number of clusters, the faster the search will be.

    Explore the latest app I’ve created, which is tailored to help you craft prompts and assess your performance utilizing my updated news dataset. Click here to start discovering the app’s features. In my app, I use all 50 clusters to search in, and 80% with the similarity threshold.

    It’s important to understand that you can search multiple articles simultaneously and get similar results. Look at the example below:

    This post connects to another one where I discuss ‘Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide on Using Generative AI to Prompt Queries in Datasets.’ However, using the cosine similarity.

    Conclusion

    Integrating Azure SQL DB with KMeans Compute Node represents a significant advancement in vector search, providing an efficient, scalable, and cost-effective solution. This innovative approach to managing and querying high-dimensional data stands as a beacon for businesses wrestling with the intricacies of big data. By leveraging such cutting-edge technologies, organizations are better positioned to unlock their data’s full potential, uncovering insights previously obscured by the sheer volume and complexity of the information. This, in turn, allows for delivering superior services and products more closely aligned with user needs and preferences.

    Moreover, adopting Azure’s robust infrastructure and the strategic application of the KMeans clustering algorithm underscores a broader shift towards more intelligent, data-driven decision-making processes. As companies strive to remain competitive in an increasingly data-centric world, the ability to swiftly and accurately sift through vast datasets to find relevant information becomes paramount. Azure SQL DB and KMeans Compute Node facilitate this, enabling businesses to improve operational efficiencies, innovate, and personalize their offerings, enhancing customer satisfaction and engagement.

    Looking ahead, the convergence of Azure SQL DB and KMeans Compute Node is setting the stage for a new era in data management and retrieval. As this technology continues to evolve and mature, it promises to open up even more possibilities for deep analytical insights and real-time data interaction. This revolution in vector search is not just about managing data more effectively; it’s about reimagining what’s possible with big data, paving the way for future innovations that will continue to transform industries and redefine user experiences. With Azure at the forefront, the future of data management is bright, marked by an era of unparalleled efficiency, scalability, and insight.

    That’s it for today!

    Sources

    Azure-Samples/azure-sql-db-vectors-kmeans: Use KMeans clustering to speed up vector search in Azure SQL DB (github.com)

    K-Means Clustering From Scratch in Python [Algorithm Explained] – AskPython