Azure SQL – 💡Tech News & Insights

Integrating Azure OpenAI with Native Vector Support in Azure SQL Databases for Advanced Search Capabilities and Data Insights

Azure SQL Database has taken a significant step forward by introducing native support for vectors, unlocking advanced capabilities for applications that rely on semantic search, AI, and machine learning. By integrating vector search into Azure SQL, developers can now store, search, and analyze vector data directly alongside traditional SQL data, offering a unified solution for complex data analysis and enhanced search experiences.

Vectors in Azure SQL Database

Vectors are numerical representations of objects like text, images, or audio. They are essential for applications involving semantic search, recommendation systems, and more. These vectors are typically generated by machine learning models, capturing the semantic meaning of the data they represent.

The new vector functionality in Azure SQL Database allows you to store and manage these vectors within a familiar SQL environment. This eliminates the need for separate vector databases, streamlining your application architecture and simplifying your data management processes.

Key Benefits of Native Vector Support in Azure SQL

Unified Data Management: Store and query both traditional and vector data in a single database, reducing complexity and maintenance overhead.
Advanced Search Capabilities: Perform similarity searches alongside standard SQL queries, leveraging Azure SQL’s sophisticated query optimizer and powerful enterprise features.
Optimized Performance: Vectors are stored in a compact binary format, allowing for efficient distance calculations and optimized performance on vector-related operations.

Embeddings: The Foundation of Vector Search

At the heart of vector search are embeddings—dense vector representations of objects, generated by deep learning models. These embeddings capture the semantic similarities between related concepts, enabling tasks such as semantic search, natural language processing, and recommendation systems.

For example, word embeddings can cluster related words like “computer,” “software,” and “machine,” while distant clusters might represent words with entirely different meanings, such as “lion,” “cat,” and “dog.” These embeddings are particularly powerful in applications where context and meaning are more important than exact keyword matches.

Azure OpenAI makes it easy to generate embeddings by providing pre-trained machine learning models accessible through REST endpoints. Once generated, these embeddings can be stored directly in an Azure SQL Database, allowing you to perform vector search queries to find similar data points.

You can explore how vector embeddings work by visiting this amazing website: Transformer Explainer. It offers an excellent interactive experience to help you better understand how Generative AI operates in general.

Vector Search Use Cases

Vector search is a powerful technique used to find vectors in a dataset that are similar to a given query vector. This capability is essential in various applications, including:

Semantic Search: Rank search results based on their relevance to the user’s query.
Recommendation Systems: Suggest related items based on similarity in vector space.
Clustering: Group similar items together based on vector similarity.
Anomaly Detection: Identify outliers in data by finding vectors that differ significantly from the norm.
Classification: Classify items based on the similarity of their vectors to predefined categories.

For instance, consider a semantic search application where a user queries for “healthy breakfast options.” A vector search would compare the vector representation of the query with vectors representing product reviews, finding the most contextually relevant items—even if the exact keywords don’t match.

Key Features of Native Vector Support in Azure SQL

Azure SQL’s native vector support introduces several new functions to operate on vectors, which are stored in a binary format to optimize performance. Here are the key functions:

JSON_ARRAY_TO_VECTOR: Converts a JSON array into a vector, enabling you to store embeddings in a compact format.
ISVECTOR: Checks whether a binary value is a valid vector, ensuring data integrity.
VECTOR_TO_JSON_ARRAY: Converts a binary vector back into a human-readable JSON array, making it easier to work with the data.
VECTOR_DISTANCE: Calculates the distance between two vectors using a chosen distance metric, such as cosine or Euclidean distance.

These functions enable powerful operations for creating, storing, and querying vector data in Azure SQL Database.

Example: Vector Search in Action

Let’s walk through an example of using Azure SQL Database to store and query vector embeddings. Imagine you have a table of customer reviews, and you want to find reviews that are contextually related to a user’s search query.

Storing Embeddings as Vectors:
After generating embeddings using Azure OpenAI, you can store these vectors in a VARBINARY(8000) column in your SQL table:

SQL

   ALTER TABLE [dbo].[FineFoodReviews] ADD [VectorBinary] VARBINARY(8000);
   UPDATE [dbo].[FineFoodReviews]
   SET [VectorBinary] = JSON_ARRAY_TO_VECTOR([vector]);

This allows you to store the embeddings efficiently, ready for vector search operations.

Performing Similarity Searches:
To find reviews that are similar to a user’s query, you can convert the query into a vector and calculate the cosine distance between the query vector and the stored embeddings:

SQL

   DECLARE @e VARBINARY(8000);
   EXEC dbo.GET_EMBEDDINGS @model = '<yourmodeldeploymentname>', @text = 'healthy breakfast options', @embedding = @e OUTPUT;

   SELECT TOP(10) ProductId,
                  Summary,
                  Text,
                  VECTOR_DISTANCE('cosine', @e, VectorBinary) AS Distance
   FROM dbo.FineFoodReviews
   ORDER BY Distance;

This query returns the top reviews that are contextually related to the user’s search, even if the exact words don’t match.

Hybrid Search with Filters:
You can enhance vector search by combining it with traditional keyword filters to improve relevance and performance. For example, you could filter reviews based on criteria like user identity, review score, or the presence of specific keywords, and then apply vector search to rank the results by relevance:

SQL

   -- Comprehensive query with multiple filters.
   SELECT TOP(10)
       f.Id,
       f.ProductId,
       f.UserId,
       f.Score,
       f.Summary,
       f.Text,
       VECTOR_DISTANCE('cosine', @e, VectorBinary) AS Distance,
       CASE 
           WHEN LEN(f.Text) > 100 THEN 'Detailed Review'
           ELSE 'Short Review'
       END AS ReviewLength,
       CASE 
           WHEN f.Score >= 4 THEN 'High Score'
           WHEN f.Score BETWEEN 2 AND 3 THEN 'Medium Score'
           ELSE 'Low Score'
       END AS ScoreCategory
   FROM FineFoodReviews f
   WHERE
       f.UserId NOT LIKE 'Anonymous%'  -- Exclude anonymous users
       AND f.Score >= 2               -- Score threshold filter
       AND LEN(f.Text) > 50           -- Text length filter for detailed reviews
       AND (f.Text LIKE '%gluten%' OR f.Text LIKE '%dairy%') -- Keyword filter
   ORDER BY
       Distance,  -- Order by cosine distance
       f.Score DESC, -- Secondary order by review score
       ReviewLength DESC; -- Tertiary order by review length

This query combines semantic search with traditional filters, balancing relevance and computational efficiency.

Leveraging REST Services for Embedding Generation

Azure OpenAI provides REST endpoints for generating embeddings, which can be consumed directly from Azure SQL Database using the sp_invoke_external_rest_endpoint system stored procedure. This integration enables seamless interaction between your data and AI models, allowing you to build intelligent applications that combine the power of machine learning with the familiarity of SQL.

Here’s a stored procedure example that retrieves embeddings from a deployed Azure OpenAI model and stores them in the database:

SQL

CREATE PROCEDURE [dbo].[GET_EMBEDDINGS]
(
    @model VARCHAR(MAX),
    @text NVARCHAR(MAX),
    @embedding VARBINARY(8000) OUTPUT
)
AS
BEGIN
    DECLARE @retval INT, @response NVARCHAR(MAX);
    DECLARE @url VARCHAR(MAX);
    DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT('input': @text);

    SET @url = 'https://<resourcename>.openai.azure.com/openai/deployments/' + @model + '/embeddings?api-version=2023-03-15-preview';

    EXEC dbo.sp_invoke_external_rest_endpoint 
        @url = @url,
        @method = 'POST',   
        @payload = @payload,   
        @headers = '{"Content-Type":"application/json", "api-key":"<openAIkey>"}', 
        @response = @response OUTPUT;

    DECLARE @jsonArray NVARCHAR(MAX) = JSON_QUERY(@response, '$.result.data[0].embedding');
    SET @embedding = JSON_ARRAY_TO_VECTOR(@jsonArray);
END
GO

This stored procedure retrieves embeddings from the Azure OpenAI model and converts them into a binary format for storage in the database, making them available for similarity search and other operations.

Let’s implementing a experiment with the Native Vector Support in Azure SQL

Azure SQL Database provides a seamless way to store and manage vector data despite not having a specific vector data type. Column-store indexes, vectors, and essentially lists of numbers can be efficiently stored in a table. Each vector can be represented in a row with individual elements as columns or serialized arrays. This approach ensures efficient storage and retrieval, making Azure SQL suitable for large-scale vector data management.

I used the Global News Dataset from Kaggle in my experiment.

First, you must create the columns to save the vector information. In my case, I created two columns: title_vector For the news title and content_vector the news content. For this, create a small Python code, but you can also do that directly from SQL using a cursor. It's important to know that you don't need to pay for any Vector Databases by saving the vector information inside the Azure SQL.

Python

from litellm import embedding
import pyodbc  # or another SQL connection library
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set up OpenAI credentials from environment variables
os.environ['AZURE_API_KEY'] =os.getenv('AZURE_API_KEY')
os.environ['AZURE_API_BASE'] = os.getenv('AZURE_API_BASE')
os.environ['AZURE_API_VERSION'] = os.getenv('AZURE_API_VERSION')

# Connect to your Azure SQL database
conn = pyodbc.connect(f'DRIVER={{ODBC Driver 17 for SQL Server}};'
                      f'SERVER={os.getenv("DB_SERVER")};'
                      f'DATABASE={os.getenv("DB_DATABASE")};'
                      f'UID={os.getenv("DB_UID")};'
                      f'PWD={os.getenv("DB_PWD")}')

def get_embeddings(text):
    # Truncate the text to 8191 characters bacause of the text-embedding-3-     small OpenAI API Embedding model limit
    truncated_text = text[:8191]

    response = embedding(
        model="azure/text-embedding-3-small",
        input=truncated_text,
        api_key=os.getenv('AZURE_API_KEY'),
        api_base=os.getenv('AZURE_API_BASE'),
        api_version=os.getenv('AZURE_API_VERSION')
        )
        
    embeddings = response['data'][0]['embedding']
    return embeddings


def update_database(article_id, title_vector, content_vector):
    cursor = conn.cursor()

    # Convert vectors to strings
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # Update the SQL query to use the string representations
    cursor.execute("""
        UPDATE newsvector
        SET title_vector = ?, content_vector = ?
        WHERE article_id = ?
    """, (title_vector_str, content_vector_str, article_id))
    conn.commit()


def embed_and_update():
    cursor = conn.cursor()
    cursor.execute("SELECT article_id, title, full_content FROM newsvector where title_vector is null and full_content is not null and title is not null order by published asc")
    
    title_vector = ""
    content_vector = ""
    
    for row in cursor.fetchall():
        article_id, title, full_content = row
        
        print(f"Embedding article {article_id} - {title}")
        
        title_vector = get_embeddings(title)
        content_vector = get_embeddings(full_content)
        
        update_database(article_id, title_vector, content_vector)

embed_and_update()

These two columns will contain something like this: [-0.02232750505208969, -0.03755787014961243, -0.0066827102564275265…]

Second, you must create a procedure in the Azure Database to transform the query into a vector embedding.

SQL

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[GET_EMBEDDINGS]
(
    @model VARCHAR(MAX),
    @text NVARCHAR(MAX),
    @embedding VARBINARY(8000) OUTPUT
)
AS
BEGIN
    DECLARE @retval INT, @response NVARCHAR(MAX);
    DECLARE @url VARCHAR(MAX);
    DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT('input': @text);

    -- Set the @url variable with proper concatenation before the EXEC statement
    SET @url = 'https://<Your App>.openai.azure.com/openai/deployments/' + @model + '/embeddings?api-version=2024-02-15-preview';

    EXEC dbo.sp_invoke_external_rest_endpoint 
        @url = @url,
        @method = 'POST',   
        @payload = @payload,   
        @headers = '{"Content-Type":"application/json", "api-key":"<Your Azure Open AI API Key"}', 
        @response = @response OUTPUT;

    -- Use JSON_QUERY to extract the embedding array directly
    DECLARE @jsonArray NVARCHAR(MAX) = JSON_QUERY(@response, '$.result.data[0].embedding');

    
    SET @embedding = JSON_ARRAY_TO_VECTOR(@jsonArray);
END

I also create another procedure to search directly to the dataset using the Native Vector Support in Azure SQL.

SQL

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER PROCEDURE [dbo].[SearchNewsVector] 
    @inputText NVARCHAR(MAX)
AS
BEGIN
    -- Query the SimilarNewsContentArticles table using the response
    IF OBJECT_ID('dbo.result', 'U') IS NOT NULL
        DROP TABLE dbo.result;

	--Assuming you have a stored procedure to get embeddings for a given text
	DECLARE @e VARBINARY(8000);
	EXEC dbo.GET_EMBEDDINGS @model = 'text-embedding-3-small', @text = @inputText, @embedding = @e OUTPUT;

	SELECT TOP(10) 
       [article_id]
      ,[source_id]
      ,[source_name]
      ,[author]
      ,[title]
      ,[description]
      ,[url]
      ,[url_to_image]
      ,[content]
      ,[category]
      ,[full_content]
      ,[title_vector]
      ,[content_vector]
      ,[published]
      ,VECTOR_DISTANCE('cosine', @e, VectorBinary) AS cosine_distance
	into result
	FROM newsvector
	ORDER BY cosine_distance;
END

Finally, you can start querying your table using prompts instead of keywords. This is awesome!

Check out the app I developed with the Native Vector Support in Azure SQL, which is designed to assist you in crafting prompts and evaluating your performance using my newsvector dataset. To explore the app, click here.

Like always, I also created this GitHub repository with everything I did.

Azure SQL Database Native vector support subscription for the Private Preview

You can sign up for the private preview at this link.

This article, published by Davide Mauri and Pooja Kamath at Microsoft Build 2024 event, provides all the information.

Announcing EAP for Vector Support in Azure SQL Database – Azure SQL Devs’ Corner (microsoft.com)

Conclusion

The integration of Azure OpenAI with native vector support in Azure SQL Database unlocks new possibilities for applications that require advanced search capabilities and data analysis. By storing and querying vector embeddings alongside traditional SQL data, you can build powerful solutions that combine the best of both worlds—semantic understanding with the reliability and performance of Azure SQL.

This innovation simplifies application development, enhances data insights, and paves the way for the next generation of intelligent applications.

That’s it for today!

Sources

Azure SQL DB Vector Functions Private Preview | Data Exposed (youtube.com)

Announcing EAP for Vector Support in Azure SQL Database – Azure SQL Devs’ Corner (microsoft.com)

Leveraging KMeans Compute Node for Text Similarity Analysis through Vector Search in Azure SQL

In the ever-evolving landscape of data management and retrieval, the ability to efficiently search through high-dimensional vector data has become a cornerstone for many modern applications, including recommendation systems, image recognition, and natural language processing tasks. Azure SQL Database (DB), in combination with KMeans clustering, is at the forefront of this revolution, offering an innovative solution that significantly enhances vector search capabilities.

What is KMeans?

KMeans is a widely used clustering algorithm in machine learning and data mining. It’s a method for partitioning an N-dimensional dataset into K distinct, non-overlapping clusters. Each cluster is defined by its centroid, which is the mean of the points in the cluster. The algorithm aims to minimize the variance within each cluster, effectively grouping the data points into clusters based on their similarity.

Let’s implement an example of the code to understand how it works Import the NumPy module and then go through the rest of the code to know how the K-Means clustering is implemented.

Python

#Loading the required modules
 
import numpy as np
from scipy.spatial.distance import cdist 
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
 
#Defining our function 
def kmeans(x,k, no_of_iterations):
    idx = np.random.choice(len(x), k, replace=False)
    #Randomly choosing Centroids 
    centroids = x[idx, :] #Step 1
     
    #finding the distance between centroids and all the data points
    distances = cdist(x, centroids ,'euclidean') #Step 2
     
    #Centroid with the minimum Distance
    points = np.array([np.argmin(i) for i in distances]) #Step 3
     
    #Repeating the above steps for a defined number of iterations
    #Step 4
    for _ in range(no_of_iterations): 
        centroids = []
        for idx in range(k):
            #Updating Centroids by taking mean of Cluster it belongs to
            temp_cent = x[points==idx].mean(axis=0) 
            centroids.append(temp_cent)
 
        centroids = np.vstack(centroids) #Updated Centroids 
         
        distances = cdist(x, centroids ,'euclidean')
        points = np.array([np.argmin(i) for i in distances])
         
    return points 
 
 
#Load Data
data = load_digits().data
pca = PCA(2)
  
#Transform the data
df = pca.fit_transform(data)
 
#Applying our function
label = kmeans(df,10,1000)
 
#Visualize the results
 
u_labels = np.unique(label)
for i in u_labels:
    plt.scatter(df[label == i , 0] , df[label == i , 1] , label = i)
plt.legend()
plt.show()

How does Voronoi Cell-based Vector Search Optimization work?

Vector Search Optimization via Voronoi Cells is an advanced technique to enhance the efficiency and accuracy of searching for similar vectors in a high-dimensional space. This method is particularly relevant in the context of Approximate Nearest Neighbor (ANN) searches, which aim to quickly find vectors close to a given query vector without exhaustively comparing the query vector against every other vector in the dataset.

Understanding Voronoi Cells

To grasp the concept of vector search optimization via Voronoi Cells, it’s essential to understand what Voronoi diagrams are. A Voronoi diagram is partitioning a plane into regions based on the distance to points in a specific subset of the plane. Each region (Voronoi cell) is defined so that any point within the region is closer to its corresponding “seed” point than any other. These seed points are typically referred to as centroids in the context of vector search.

Application in Vector Search

Voronoi Cells can efficiently partition the high-dimensional space into distinct regions in vector search. Each area represents a cluster of vectors closer to its centroid than any other centroid. This approach is based on the assumption that vectors within the same Voronoi cell are more likely to be similar to each other than to vectors in different cells.

The Process

Centroid Initialization: Like KMeans clustering, the process begins by selecting a set of initial centroids in the high-dimensional space.
Voronoi Partitioning: The space is partitioned into Voronoi cells, each associated with one centroid. This partitioning is done such that every vector in the dataset is assigned to the cell of the closest centroid.
Indexing and Search Optimization: Once the high-dimensional space is partitioned, an inverted file index or a similar data structure can be created to map each centroid to the list of vectors (or pointers to them) within its corresponding Voronoi cell. During a search query, instead of comparing the query vector against all vectors in the dataset, the search can be limited to vectors within the most relevant Voronoi cells, significantly reducing the search space and time.

Advantages

Efficiency: By reducing the search space to a few relevant Voronoi cells, the algorithm can achieve faster search times than brute-force searches.
Scalability: This method scales better with large datasets, as the overhead of partitioning the space and indexing is compensated by the speedup in query times.
Flexibility: The approach can be adapted to various data types and dimensionalities by adjusting the centroid selection and cell partitioning methods.

Introducing the Project of Azure SQL DB and KMeans Compute Node

Azure SQL DB has long been recognized for its robustness, scalability, and security as a cloud database service. By integrating KMeans clustering—a method used to partition data into k distinct clusters based on similarity—the capability of Azure SQL DB is expanded to include advanced vector search operations.

The KMeans Compute Node is a specialized component that handles the compute-intensive task of clustering high-dimensional data. This integration optimizes the performance of vector searches and simplifies the management and deployment of these solutions.

How It Works

Data Storage: Vector data is stored in Azure SQL DB, leveraging its high availability and scalable storage solutions. This setup ensures that data management adheres to best practices regarding security and compliance.
Vector Clustering: The KMeans Compute Node performs clustering on the vector data. This process groups vectors into clusters based on similarity, significantly reducing the search space for query operations.
Search Optimization: Approximate Nearest Neighbor (ANN) searches can be executed more efficiently with vectors organized into clusters. Queries are directed towards relevant clusters rather than the entire dataset, enhancing search speed and accuracy.
Seamless Integration: The entire process is streamlined through Azure Container Apps, which host the KMeans Compute Node. This setup provides a scalable, serverless environment that dynamically adjusts demand-based resources.

Advantages of the Azure SQL DB and KMeans Approach

Performance: By reducing the complexity of vector searches, response times are significantly improved, allowing for real-time search capabilities even in large datasets.
Scalability: The solution effortlessly scales with your data, thanks to Azure’s cloud infrastructure. This ensures that growing data volumes do not compromise search efficiency.
Cost-Effectiveness: Azure SQL DB offers a cost-efficient storage solution, while using Azure Container Apps for the KMeans Compute Node optimizes resource utilization, reducing overall expenses.
Simplicity: The integration simplifies the architecture of vector search systems, making it easier to deploy, manage, and maintain these solutions.

Use Cases and Applications

The Azure SQL DB and KMeans Compute Node solution is versatile, supporting a wide range of applications:

1-Analysis of Similarity in Court Decisions

Legal Research and Precedents: By analyzing the similarities in court decisions, legal professionals can efficiently find relevant precedents to support their cases. This application can significantly speed up legal research, ensuring lawyers can access comprehensive and pertinent case law that aligns closely with their current matters.

2-Personalized Medicine and Genomic Data Analysis

Drug Response Prediction: Leveraging vector search to analyze genomic data allows researchers to predict how patients might respond to different treatments. By clustering patients based on genetic markers, medical professionals can tailor treatments to individual genetic profiles, advancing the field of personalized medicine.

3-Market Trend Analysis

Consumer Behavior Clustering: Businesses can cluster consumer behavior data to identify emerging market trends and tailor their marketing strategies accordingly. Vector search can help analyze high-dimensional data, such as purchase history and online behavior, to segment consumers into groups with similar preferences and behaviors.

4-Cybersecurity Threat Detection

Anomaly Detection in Network Traffic: Vector search can monitor network traffic, identifying unusual patterns that may indicate cybersecurity threats. By clustering network events, it’s possible to quickly isolate and investigate anomalies, enhancing an organization’s ability to respond to potential security breaches.

5-Educational Content and Learning Style Personalization

Matching Educational Materials to Learning Styles: By clustering educational content and student profiles, educational platforms can personalize learning experiences. Vector search can identify the most suitable materials and teaching methods for different learning styles, improving student engagement and outcomes.

6-Environmental Monitoring and Conservation Efforts

Species Distribution Modeling: Vector search can analyze environmental data to model the distribution of various species across different habitats. This information is crucial for conservation planning, helping identify critical areas for biodiversity conservation.

7-Supply Chain Optimization

Predictive Maintenance and Inventory Management: In supply chain management, vector search can cluster equipment performance data to predict maintenance needs and optimize inventory levels. This application ensures that operations run smoothly, with minimal downtime and efficient use of resources.

8-Creative Industries and Content Creation

Similarity Analysis in Music and Art: Artists and creators can use vector search to analyze patterns and themes in music, art, and literature. This approach can help understand influences, trends, and the evolution of styles over time, providing valuable insights for new creations.

Architecture of the project

The project’s architecture is straightforward as it comprises a single container that exposes a REST API to build and rebuild the index and search for similar vectors. The container is deployed to Azure Container Apps and uses Azure SQL DB to store the vectors and the clusters.

The idea is that compute-intensive operations, like calculating KMeans, can be offloaded to a dedicated container that is easy to deploy, quick to start, and offers serverless scaling for the best performance/cost ratio.

Once the container runs, it is entirely independent of the database and can work without affecting database performance. Even better, if more scalability is needed, data can be partitioned across multiple container instances to achieve parallelism.

Once the model has been trained, the identified clusters and centroids – and thus the IVF index – are saved back to the SQL DB so that they can be used to perform ANN search on the vector column without the need for the container to remain active. The container can be stopped entirely as SQL DB is completely autonomous now.

The data is stored back in SQL DB using the following tables:

[$vector].[kmeans]: stores information about created indexes
[$vector].[<table_name>$<column_name>$clusters_centroids]: stores the centroids
[$vector].[<table_name>$<column_name>$clusters]: the IVF structure, associating each centroid to the list of vectors assigned to it

to search even more accessible, a function is created also:

[$vector].[find_similar$<table_name>$<column_name>](<vector>, <probe cells count>, <similarity threshold>): the function to perform ANN search

The function calculates the dot product, the same as the cosine similarity if vectors are normalized to 1.

Also, the function:

[$vector].[find_cluster$<table_name>$<column_name>](<vector>): find the cluster of a given vector

is provided as it is needed to insert new vectors into the IVF index.

Implementation

The project is divided into two GitHub repositories: one with Python source code for the KMeans compute node created by Davide Mauri, Principal Product Manager in Azure SQL DB at Microsoft, and the other part with the actual example app I have created to test the project.

Azure SQL DB Vector – KMeans Compute Node

KMeans model from Scikit Learn is executed within a container as a REST endpoint. The API exposed by the container are:

Server Status: GET /
Build Index: POST /kmeans/build
Rebuild Index: POST /kmeans/rebuild

Both Build and Rebuild API are asynchronous. The Server Status API can be used to check the status of the build process.

Build Index

To build an index from scratch, the Build API expects the following payload:

JSON

{
  "table": {
    "schema": <schema name>,
    "name": <table name>
  },
  "column": {
    "id": <id column name>,
    "vector": <vector column name>
  },
  "vector": {
    "dimensions": <dimensions>
  }
}

Using the aforementioned Wikipedia dataset, the payload would be:

JSON

POST /kmeans/build
{
    "table": {
        "schema": "dbo",
        "name": "news"
    },
    "column": {
        "id": "article_id",
        "vector": "content_vector"
    },
    "vector": {
        "dimensions": 1536
    }
}

The API would verify that the request is correct and then start the build process asynchronously, returning the ID assigned to the index being created:

JSON

{
  "id": 1,
  "status": {
    "status": {
      "current": "initializing",
      "last": "idle"
    },
    "index_id": "1"
  }
}

The API will return an error if the index on the same table and vector column already exists. If you want to force the creation of a new index over the existing one, you can use the force Option:

POST /kmeans/build?force=true

Rebuild Index

If you need to rebuild an existing index, you can use the Rebuild API. The API doesn’t need a payload, as it will use the existing index definition. Just like the build process, the rebuild process is also asynchronous. The index to be rebuilt is specified via the URL path:

POST /kmeans/rebuild/<index id>

For example, to rebuild the index with id 1:

POST /kmeans/rebuild/1

Query API Status

The status of the build process can be checked using the Server Status API:

GET /

And you’ll get the current status and the last status report:

JSON

{
  "server": {
    "status": {
      "current": "building",
      "last": "initializing"
    },
    "index_id": 1
  },
  "version": "0.0.1"
}

Checking the previous status is helpful to understand whether an error occurred during the build process.

You can also check the index build status by querying the [$vector].[kmeans] table.

2. Leveraging KMeans Compute Node for Text Similarity Analysis through Vector Search in Azure SQL

Search for similar vectors.

Once you have built the index, you can search for similar vectors. Using the sample dataset, you can search for the 10 most similar news to ‘How Generative AI Is Transforming Today’s And Tomorrow’s Software Development Life Cycle.’ using the find_similar function created as part of the index build process. For example:

SQL

/*
    This SQL code is used to search for similar news articles based on a given input using vector embeddings.
    It makes use of an external REST endpoint to retrieve the embeddings for the input text.
    The code then calls the 'find_similar$news$content_vector' function to find the top 10 similar news articles.
    The similarity is calculated based on the dot product of the embeddings.
    The result is ordered by the dot product in descending order.
*/

declare @response nvarchar(max);
declare @payload nvarchar(max) = json_object('input': 'How Generative AI Is Transforming Today’s And Tomorrow’s Software Development Life Cycle.');

exec sp_invoke_external_rest_endpoint
    @url = 'https://<YOUR APP>.openai.azure.com/openai/deployments/embeddings/embeddings?api-version=2023-03-15-preview',
    @credential = [https://<YOUR APP>.openai.azure.com],
    @payload = @payload,
    @response = @response output;

select top 10 r.published, r.category, r.author, r.title, r.content, r.dot_product
from [$vector].find_similar$news$content_vector(json_query(@response, '$.result.data[0].embedding'), 50, 0.80)  AS r
order by dot_product desc

The find_similar the function takes three parameters:

the vector to search for
the number of clusters to search in
the similarity threshold

The similarity threshold filters out vectors that are not similar enough to the query vector. The higher the threshold, the more similar the vectors returned will be. The number of clusters to search in is used to speed up the search. The higher the number of clusters, the more similar the vectors returned will be. The lower the number of clusters, the faster the search will be.

Explore the latest app I’ve created, which is tailored to help you craft prompts and assess your performance utilizing my updated news dataset. Click here to start discovering the app’s features. In my app, I use all 50 clusters to search in, and 80% with the similarity threshold.

It’s important to understand that you can search multiple articles simultaneously and get similar results. Look at the example below:

This post connects to another one where I discuss ‘Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide on Using Generative AI to Prompt Queries in Datasets.’ However, using the cosine similarity.

Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide How to Use Generative AI to Prompt Queries in Datasets

Conclusion

Integrating Azure SQL DB with KMeans Compute Node represents a significant advancement in vector search, providing an efficient, scalable, and cost-effective solution. This innovative approach to managing and querying high-dimensional data stands as a beacon for businesses wrestling with the intricacies of big data. By leveraging such cutting-edge technologies, organizations are better positioned to unlock their data’s full potential, uncovering insights previously obscured by the sheer volume and complexity of the information. This, in turn, allows for delivering superior services and products more closely aligned with user needs and preferences.

Moreover, adopting Azure’s robust infrastructure and the strategic application of the KMeans clustering algorithm underscores a broader shift towards more intelligent, data-driven decision-making processes. As companies strive to remain competitive in an increasingly data-centric world, the ability to swiftly and accurately sift through vast datasets to find relevant information becomes paramount. Azure SQL DB and KMeans Compute Node facilitate this, enabling businesses to improve operational efficiencies, innovate, and personalize their offerings, enhancing customer satisfaction and engagement.

Looking ahead, the convergence of Azure SQL DB and KMeans Compute Node is setting the stage for a new era in data management and retrieval. As this technology continues to evolve and mature, it promises to open up even more possibilities for deep analytical insights and real-time data interaction. This revolution in vector search is not just about managing data more effectively; it’s about reimagining what’s possible with big data, paving the way for future innovations that will continue to transform industries and redefine user experiences. With Azure at the forefront, the future of data management is bright, marked by an era of unparalleled efficiency, scalability, and insight.

That’s it for today!

Sources

Azure-Samples/azure-sql-db-vectors-kmeans: Use KMeans clustering to speed up vector search in Azure SQL DB (github.com)

K-Means Clustering From Scratch in Python [Algorithm Explained] – AskPython

Interactive Data Analysis: Chat with Your Data in Azure SQL Database Using Vanna AI

In an era where data is the new gold, the ability to effectively mine, understand, and utilize this valuable resource determines the success of businesses. Traditional data analysis methods often create a bottleneck due to their complexity and the need for specialized skills. This is where the groundbreaking integration of Vanna AI with Azure SQL Database heralds a new dawn. Inspired by the pivotal study “AI SQL Accuracy: Testing different LLMs + context strategies to maximize SQL generation accuracy,” this article explores how Vanna AI is not just an innovation but a revolution in data analytics. It simplifies complex data queries into conversational language, making data analysis accessible to all, irrespective of their technical prowess.

Understanding Vanna AI: The Next Frontier in Data Analytics

Vanna AI emerges as a pivotal innovation in the rapidly evolving landscape of artificial intelligence and data management. But what exactly is Vanna AI, and why is it becoming a game-changer in data analytics? Let’s delve into the essence of Vanna AI and its transformative impact.

What is Vanna AI?

Vanna AI is an advanced AI-driven tool designed to bridge the gap between complex data analysis and user-friendly interaction. At its core, Vanna AI is a sophisticated application of Large Language Models (LLMs) optimized for interacting with databases. It leverages the power of AI to translate natural language queries into precise SQL commands, effectively allowing users to “converse” with their databases.

Key Features and Capabilities

Natural Language Processing (NLP): Vanna AI excels at understanding and processing human language, enabling users to ask questions in plain English and receive accurate data insights.
Contextual Awareness: One of the standout features of Vanna AI is its ability to understand a specific database’s structure and nuances contextually. This includes schema definitions, documentation, and historical queries, significantly enhancing the accuracy of SQL generation.
Adaptability Across Databases: Vanna AI is not limited to a single type of database. Its versatility allows it to be integrated with various database platforms, including Azure SQL Database, enhancing its applicability across different business environments.
Ease of Use: By simplifying the process of data querying, Vanna AI democratizes data analysis, making it accessible to non-technical users, such as business analysts, marketing professionals, and decision-makers.

How Vanna works

Vanna works in two easy steps – train a RAG “model” on your data and then ask questions that will return SQL queries that can be set up to run on your database automatically.

vn.train(...): Train a RAG “model” on your data. These methods add to the reference corpus below.
vn.ask(...): Ask questions. This will use the reference corpus to generate SQL queries that can be run on your database.

Empowering SQL Generation with AI

The challenge in traditional data analysis has been the necessity of SQL expertise. Vanna AI disrupts this norm by enabling users to frame queries in plain language and translate them into SQL. This approach democratizes data access and accelerates decision-making by providing quicker insights.

The research compared the efficacy of various Large Language Models (LLMs) like Google Bison, GPT 3.5, GPT 4 turbo, and Llama 2 in generating SQL. While GPT 4 excelled overall performance, the study highlighted that other LLMs could achieve comparable accuracy with the proper context.

Presenting the Practical Application I Developed for Your Evaluation.

A testament to Vanna AI’s practical application, I created an example app that you can test yourself and understand how it works, an innovative application designed for the Microsoft Adventure Works database. Available at this URL. This application exemplifies how AI can transform data interaction. It allows users to converse with the Adventure Works database in natural language, simplifying complex data queries and making data analysis more approachable and efficient.

Exploring the AdventureWorksLT Schema: An Overview of Database Relationships and Structure

Here is a concise introduction to the Adventure Work database. This will help you better understand the database structures and tables, enabling you to make more effective inquiries in the test application I developed.

In the Dbo schema, there is an ErrorLog table designed to capture error information, with fields such as ErrorTime, UserName, and ErrorMessage. The CustomerAddress table bridges customers to addresses, suggesting a many-to-many relationship as one customer can have multiple addresses, and one address can be associated with multiple customers.

The SalesLT schema is more complex and includes several interconnected tables:

Product: Contains product details, such as name, product number, color, and size.
ProductCategory: Organizes products into hierarchical categories.
ProductModel: Defines models for products, which could include multiple products under a single model.
ProductModelProductDescription: This link between product models and their descriptions indicates a many-to-many relationship between models and descriptions facilitated by a culture identifier.
ProductDescription: Stores descriptions for products in different languages (indicated by the Culture field).
Address: Holds address information and is related to customers through the CustomerAddress table.
Customer: Holds customer information such as name, contact details, and password hashes for customer accounts.
SalesOrderHeader: Captures the header information of sales orders, including details like order date, due date, and total due amount.
SalesOrderDetail: Provides line item details for each sales order, such as quantity and price.

The schema includes primary keys (PK) to uniquely identify each entry in a table, foreign keys (FK) to establish relationships between tables, and indexes (U1, U2) to improve query performance on the database.

Explore the Source Codes of the app I developed.

To develop an app yourself using the Azure SQL database, click this li n k to access my GitHub repository containing all source codes.

Conclusion

As we stand at the cusp of a data revolution, Vanna AI’s integration with Azure SQL Database and its practical embodiment in applications like the app I created, for example, for the Microsoft Adventure Works database, represents more than technological advancement; they signify a paradigm shift in data interaction and analysis. This evolution marks the transition from data being experts’ exclusive domain to becoming a universal language understood and utilized across various business sectors. The journey of data analytics, powered by AI and made user-friendly through Vanna AI, is not just about technological transformation; it’s about empowering organizations and individuals with the tools to unlock the true potential of their data. Stay connected with the evolving world of Vanna AI and discover how this revolutionary tool can redefine your approach to data, paving the way for a more informed, efficient, and data-driven future.

That’s it for today!

Vanna.AI – Personalized AI SQL Agent

Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide How to Use Generative AI to Prompt Queries in Datasets

The evolving landscape of data analytics has brought vector databases to the forefront, especially with their application in finding similarities in diverse data types such as articles, photos, and products. Azure SQL, combined with the prowess of OpenAI, offers a powerful platform for executing vector operations, simplifying the task of finding similar items and enhancing recommendation systems in applications.

What is Vector Similarity, and How do you calculate cosine similarity?

Vector similarity revolves around transforming data into numerical vectors or embeddings. These embeddings are numerical representations of various concepts converted into sequences of numbers, making it easier for computers to grasp their relationships. This method is particularly effective in comparing and finding similarities between data points, a crucial feature in applications like search engines and clustering algorithms.

Cosine similarity, a commonly used metric in vector similarity, measures the cosine of the angle between two vectors. This metric is crucial in determining the degree of similarity between the vectors, irrespective of their size. In Azure SQL, cosine similarity can be computed with a simple SQL formula involving SUM and SQRT functions applied to the vector elements, thus providing a straightforward yet powerful way to measure vector similarity.

Cosine similarity can be calculated in SQL using the following formula, given two vectors, a and b:

SELECT 
    SUM(a.value * b.value) / (  
        SQRT(SUM(a.value * a.value)) * SQRT(SUM(b.value * b.value))   
    ) AS cosine_similarity
FROM
    vectors_values

How to Querying Azure OpenAI?

Azure SQL’s integration with Azure OpenAI simplifies generating and working with embeddings. Users can obtain the vector representation of any given text by creating an Azure OpenAI resource and deploying a model like text-embedding-ada-002. This integration enables the execution of REST API calls from within the Azure SQL Database, making fetching and working with embeddings more streamlined and accessible.

SQL

-- Declare a variable to hold the response from the external REST API
declare @response nvarchar(max);

-- Declare and initialize a variable with JSON payload. 
-- The JSON object contains an 'input' key with a text value.
declare @payload nvarchar(max) = json_object('input': 'This is how to futureproof your career in an AI world');

-- Execute a stored procedure to invoke an external REST endpoint
exec sp_invoke_external_rest_endpoint
    @url = 'https://<your-app-name>.openai.azure.com/openai/deployments/embeddings/embeddings?api-version=2023-03-15-preview', -- The URL of the REST endpoint
    @credential = [https://<your-app-name>.openai.azure.com], -- Credential for accessing the REST API
    @payload = @payload, -- The JSON payload defined earlier
    @response = @response output; -- Output parameter to store the response from the API

Advantages of Implementing Vectors in Azure SQL Database

1. Ease of Access with Azure OpenAI: Azure SQL combined with Azure OpenAI offers easy access to REST services for generating embeddings using pre-trained machine learning models. This accessibility facilitates the calculation of embeddings, which is otherwise a complex task.

2. Efficient Storage with Columnstore Indexes: Azure SQL databases efficiently store vectors using column store indexes. This method is particularly beneficial since Azure SQL doesn’t have a specific vector data type. Vectors, essentially lists of numbers, can be conveniently stored in a table with one row per vector element.

3. Fast Distance Calculation: The internal optimization of column store indexes in Azure SQL, employing SIMD and AVX-512 instructions, allows for high-speed calculation of distances between vectors, which is crucial for determining similarity.

4. Integration with Azure AI Search: Azure SQL’s integration with Azure AI Search streamlines the entire process of chunking, generating, storing, and querying vectors for vector search, significantly speeding up the development of the vectorization pipeline and minimizing maintenance tasks.

5. Capability for Complex Operations: Azure SQL enables complex operations like indexing, storing, and retrieving vector embeddings from a search index, which is essential for identifying the most similar documents in a vector space.

6. Versatile Data Handling: Azure SQL’s ability to handle structured and unstructured data, along with vector data, provides more versatility compared to vector databases, which are primarily optimized for vector data storage and retrieval.

7. Ease of Querying and Relevance Determination: The integration with Azure OpenAI allows easy querying of the REST service to obtain vector representations of text, which can then be used to calculate similarity against stored vectors, identifying the most relevant data.

8. Simplified Deployment and Management: Deploying and managing an embedding model via the Azure portal is straightforward, reducing the complexity of managing vector databases.

9. Suitability for a Range of Applications: While vector databases are specialized for high-dimensional similarity searches, Azure SQL’s broader application scope makes it suitable for various types of data and applications, from financial records to customer data.

10. Support for Advanced Azure Features: Azure SQL supports advanced Azure features, such as AI and machine learning capabilities, which can be seamlessly integrated with vector similarity operations for enhanced analytics and insights.

These advantages highlight the flexibility, efficiency, and ease of use of Azure SQL databases for vector similarity operations, making them a preferable choice in scenarios where diverse data types and complex operations are involved, alongside the need for seamless integration with other Azure services.

Implementing Vectors in Azure SQL Database

I used the Global News Dataset from Kaggle in my experiment.

Python

import openai
import pyodbc  # or another SQL connection library

# Set up OpenAI credentials
openai.api_type = "azure"
openai.api_key = "<YOUR AZURE OPEN AI KEY>"
openai.api_base = "https://<your-app-name>.openai.azure.com/"
openai.api_version = "2023-07-01-preview"

# Connect to your Azure SQL database
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};'
                      'SERVER=<SERVER>;DATABASE=<DATABASE>;'
                      'UID=<USER>;PWD=<PASSWORD>')

def get_embeddings(text):
    # Truncate the text to 8000 characters
    truncated_text = text[:8000]

    response = openai.Embedding.create(input=truncated_text, engine="embeddings")
    embeddings = response['data'][0]['embedding']
    return embeddings

def update_database(article_id, title_vector, content_vector):
    cursor = conn.cursor()

    # Convert vectors to strings
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # Update the SQL query to use the string representations
    cursor.execute("""
        UPDATE news
        SET title_vector = ?, content_vector = ?
        WHERE article_id = ?
    """, (title_vector_str, content_vector_str, article_id))
    conn.commit()

def embed_and_update():
    cursor = conn.cursor()
    cursor.execute("SELECT article_id, title, full_content FROM news where title_vector is null and full_content is not null and title is not null order by published desc")
    
    title_vector = ""
    content_vector = ""
    
    for row in cursor.fetchall():
        article_id, title, full_content = row
        
        print(f"Embedding article {article_id} - {title}")
        
        title_vector = get_embeddings(title)
        content_vector = get_embeddings(full_content)
        
        update_database(article_id, title_vector, content_vector)

embed_and_update()

These two columns will contain something like this: [-0.02232750505208969, -0.03755787014961243, -0.0066827102564275265…]

Second, you must extract these vectors line by line into a new table for each vector field. In this case, I created a procedure to do it.

SQL

-- =============================================
-- Author:      Lawrence Teixeira
-- Create Date: 11-24-2023
-- Description: This stored procedure creates vectors for news titles and contents.
-- It processes data from the 'news' table and stores the vectors in separate tables.
-- =============================================

CREATE PROCEDURE [dbo].[Create_News_Vector]
AS
BEGIN
    SET NOCOUNT ON; -- Prevents the sending of DONE_IN_PROC messages to the client.

    -- Extract and store title vectors:
    -- First, check if the 'news_titles_vector' table exists and drop it if it does.
    IF OBJECT_ID('dbo.news_titles_vector', 'U') IS NOT NULL
        DROP TABLE dbo.news_titles_vector;

    -- Using a Common Table Expression (CTE) to process title vectors.
    WITH cte AS
    (
        SELECT 
            v.article_id,    
            CAST(tv.[key] AS INT) AS vector_value_id, -- Casting 'key' as INT for vector ID.
            CAST(tv.[value] AS FLOAT) AS vector_value   -- Casting 'value' as FLOAT for vector value.
        FROM 
            dbo.news AS v 
        CROSS APPLY 
            OPENJSON(v.title_vector) tv -- Parses JSON of title_vector in the 'news' table.
    )
    -- Create 'news_titles_vector' table with processed vectors.
    SELECT
        article_id,
        vector_value_id,
        vector_value
    INTO
        dbo.news_titles_vector
    FROM
        cte;

    -- Extract and store content vectors:
    -- Check and drop 'news_contents_vector' table if it exists.
    IF OBJECT_ID('dbo.news_contents_vector', 'U') IS NOT NULL
        DROP TABLE dbo.news_contents_vector;

    -- CTE for processing content vectors.
    WITH cte AS
    (
        SELECT 
            v.article_id,    
            CAST(tv.[key] AS INT) AS vector_value_id, -- Casting 'key' as INT for vector ID.
            CAST(tv.[value] AS FLOAT) AS vector_value   -- Casting 'value' as FLOAT for vector value.
        FROM 
            dbo.news AS v 
        CROSS APPLY 
            OPENJSON(v.content_vector) tv -- Parses JSON of content_vector in the 'news' table.
    )
    -- Create 'news_contents_vector' table with processed vectors.
    SELECT
        article_id,
        vector_value_id,
        vector_value
    INTO
        dbo.news_contents_vector
    FROM
        cte;

    -- Columnstore indexes creation is advised outside the stored procedure.
    -- These indexes optimize data storage and query performance on vector tables.
    CREATE CLUSTERED COLUMNSTORE INDEX cci_news_titles_vector
    ON dbo.news_titles_vector 
	order (article_id);

    CREATE CLUSTERED COLUMNSTORE INDEX cci_news_contents_vector
    ON dbo.news_contents_vector
	order (article_id);
END

I also create a function to search directly to the dataset using the Cosine similarity formula.

SQL

-- This Azure SQL function finds news articles similar to the given content vector.
CREATE function [dbo].[SimilarNewsContentArticles](
    @vector nvarchar(max) -- Input parameter: JSON string representing a content vector.
)
returns table -- The function returns a table.
as
return with 

-- CTE for processing the input vector.
cteVector as
(
    select 
        cast([key] as int) as [vector_value_id], -- Extracts and casts the 'key' from JSON to int.
        cast([value] as float) as [vector_value] -- Extracts and casts the 'value' from JSON to float.
    from 
        openjson(@vector) -- Parses the input JSON vector.
),

-- CTE for calculating similarity scores with existing articles.
cteSimilar as
(
    select top (50)
        v2.article_id, 
        sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance 
        -- Calculates cosine similarity (distance) between vectors.
    from 
        cteVector v1 -- Uses the processed input vector.
    inner join 
        dbo.news_contents_vector v2 
        on v1.vector_value_id = v2.vector_value_id -- Joins with stored article vectors.
    group by
        v2.article_id
    order by
        cosine_distance desc -- Orders by similarity score, descending.
)

-- Final selection combining article data with similarity scores.
select 
    a.*, -- Selects all columns from the news article.
    r.cosine_distance -- Includes the calculated similarity score.
from 
    cteSimilar r -- Uses the similarity scores CTE.
inner join 
    dbo.news a on r.article_id = a.article_id -- Joins with the news articles table.
GO

Finally, you can start querying your table using prompts instead of keywords. This is awesome!

Check out the app I developed, which is designed to assist you in crafting prompts and evaluating your performance using my news dataset. To explore the app, click here.

I also created this GitHub repository with everything I did.

Click on this image to open the GitHub repository.

Conclusion

While the sample in this blog is not optimized for maximum efficiency, it is an excellent starting point for understanding and implementing vector operations in Azure SQL. The process, despite its simplicity, is relatively fast. For instance, querying an eight vCore Azure SQL database can return the fifty most similar articles in just half a second, demonstrating the efficiency and utility of vector operations in Azure SQL Database for data analytics and insights. If you want to know more about this topic, don’t hesitate to get in touch with me.

That’s it for today!

Sources:

Azure-Samples/azure-sql-db-openai: Samples on how to use Azure SQL database with Azure OpenAI (github.com)