Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide How to Use Generative AI to Prompt Queries in Datasets

The evolving landscape of data analytics has brought vector databases to the forefront, especially with their application in finding similarities in diverse data types such as articles, photos, and products. Azure SQL, combined with the prowess of OpenAI, offers a powerful platform for executing vector operations, simplifying the task of finding similar items and enhancing recommendation systems in applications.

What is Vector Similarity, and How do you calculate cosine similarity?

Vector similarity revolves around transforming data into numerical vectors or embeddings. These embeddings are numerical representations of various concepts converted into sequences of numbers, making it easier for computers to grasp their relationships. This method is particularly effective in comparing and finding similarities between data points, a crucial feature in applications like search engines and clustering algorithms.

Cosine similarity, a commonly used metric in vector similarity, measures the cosine of the angle between two vectors. This metric is crucial in determining the degree of similarity between the vectors, irrespective of their size. In Azure SQL, cosine similarity can be computed with a simple SQL formula involving SUM and SQRT functions applied to the vector elements, thus providing a straightforward yet powerful way to measure vector similarity.

Cosine similarity can be calculated in SQL using the following formula, given two vectors, a and b:

SELECT 
    SUM(a.value * b.value) / (  
        SQRT(SUM(a.value * a.value)) * SQRT(SUM(b.value * b.value))   
    ) AS cosine_similarity
FROM
    vectors_values

How to Querying Azure OpenAI?

Azure SQL’s integration with Azure OpenAI simplifies generating and working with embeddings. Users can obtain the vector representation of any given text by creating an Azure OpenAI resource and deploying a model like text-embedding-ada-002. This integration enables the execution of REST API calls from within the Azure SQL Database, making fetching and working with embeddings more streamlined and accessible.

SQL

-- Declare a variable to hold the response from the external REST API
declare @response nvarchar(max);

-- Declare and initialize a variable with JSON payload. 
-- The JSON object contains an 'input' key with a text value.
declare @payload nvarchar(max) = json_object('input': 'This is how to futureproof your career in an AI world');

-- Execute a stored procedure to invoke an external REST endpoint
exec sp_invoke_external_rest_endpoint
    @url = 'https://<your-app-name>.openai.azure.com/openai/deployments/embeddings/embeddings?api-version=2023-03-15-preview', -- The URL of the REST endpoint
    @credential = [https://<your-app-name>.openai.azure.com], -- Credential for accessing the REST API
    @payload = @payload, -- The JSON payload defined earlier
    @response = @response output; -- Output parameter to store the response from the API

Advantages of Implementing Vectors in Azure SQL Database

1. Ease of Access with Azure OpenAI: Azure SQL combined with Azure OpenAI offers easy access to REST services for generating embeddings using pre-trained machine learning models. This accessibility facilitates the calculation of embeddings, which is otherwise a complex task.

2. Efficient Storage with Columnstore Indexes: Azure SQL databases efficiently store vectors using column store indexes. This method is particularly beneficial since Azure SQL doesn’t have a specific vector data type. Vectors, essentially lists of numbers, can be conveniently stored in a table with one row per vector element.

3. Fast Distance Calculation: The internal optimization of column store indexes in Azure SQL, employing SIMD and AVX-512 instructions, allows for high-speed calculation of distances between vectors, which is crucial for determining similarity.

4. Integration with Azure AI Search: Azure SQL’s integration with Azure AI Search streamlines the entire process of chunking, generating, storing, and querying vectors for vector search, significantly speeding up the development of the vectorization pipeline and minimizing maintenance tasks.

5. Capability for Complex Operations: Azure SQL enables complex operations like indexing, storing, and retrieving vector embeddings from a search index, which is essential for identifying the most similar documents in a vector space.

6. Versatile Data Handling: Azure SQL’s ability to handle structured and unstructured data, along with vector data, provides more versatility compared to vector databases, which are primarily optimized for vector data storage and retrieval.

7. Ease of Querying and Relevance Determination: The integration with Azure OpenAI allows easy querying of the REST service to obtain vector representations of text, which can then be used to calculate similarity against stored vectors, identifying the most relevant data.

8. Simplified Deployment and Management: Deploying and managing an embedding model via the Azure portal is straightforward, reducing the complexity of managing vector databases.

9. Suitability for a Range of Applications: While vector databases are specialized for high-dimensional similarity searches, Azure SQL’s broader application scope makes it suitable for various types of data and applications, from financial records to customer data.

10. Support for Advanced Azure Features: Azure SQL supports advanced Azure features, such as AI and machine learning capabilities, which can be seamlessly integrated with vector similarity operations for enhanced analytics and insights.

These advantages highlight the flexibility, efficiency, and ease of use of Azure SQL databases for vector similarity operations, making them a preferable choice in scenarios where diverse data types and complex operations are involved, alongside the need for seamless integration with other Azure services.

Implementing Vectors in Azure SQL Database

Azure SQL Database provides a seamless way to store and manage vector data despite not having a specific vector data type. Column-store indexes, vectors, and essentially lists of numbers can be efficiently stored in a table. Each vector can be represented in a row with individual elements as columns or serialized arrays. This approach ensures efficient storage and retrieval, making Azure SQL suitable for large-scale vector data management.

I used the Global News Dataset from Kaggle in my experiment.

First, you must create the columns to save the vector information. In my case, I created two columns: title_vector For the news title and content_vector the news content. For this, create a small Python code, but you can also do that directly from SQL using a cursor. It's important to know that you don't need to pay for any Vector Databases by saving the vector information inside the Azure SQL.

Python

import openai
import pyodbc  # or another SQL connection library

# Set up OpenAI credentials
openai.api_type = "azure"
openai.api_key = "<YOUR AZURE OPEN AI KEY>"
openai.api_base = "https://<your-app-name>.openai.azure.com/"
openai.api_version = "2023-07-01-preview"

# Connect to your Azure SQL database
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};'
                      'SERVER=<SERVER>;DATABASE=<DATABASE>;'
                      'UID=<USER>;PWD=<PASSWORD>')

def get_embeddings(text):
    # Truncate the text to 8000 characters
    truncated_text = text[:8000]

    response = openai.Embedding.create(input=truncated_text, engine="embeddings")
    embeddings = response['data'][0]['embedding']
    return embeddings

def update_database(article_id, title_vector, content_vector):
    cursor = conn.cursor()

    # Convert vectors to strings
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # Update the SQL query to use the string representations
    cursor.execute("""
        UPDATE news
        SET title_vector = ?, content_vector = ?
        WHERE article_id = ?
    """, (title_vector_str, content_vector_str, article_id))
    conn.commit()

def embed_and_update():
    cursor = conn.cursor()
    cursor.execute("SELECT article_id, title, full_content FROM news where title_vector is null and full_content is not null and title is not null order by published desc")
    
    title_vector = ""
    content_vector = ""
    
    for row in cursor.fetchall():
        article_id, title, full_content = row
        
        print(f"Embedding article {article_id} - {title}")
        
        title_vector = get_embeddings(title)
        content_vector = get_embeddings(full_content)
        
        update_database(article_id, title_vector, content_vector)

embed_and_update()

These two columns will contain something like this: [-0.02232750505208969, -0.03755787014961243, -0.0066827102564275265…]

Second, you must extract these vectors line by line into a new table for each vector field. In this case, I created a procedure to do it.

SQL

-- =============================================
-- Author:      Lawrence Teixeira
-- Create Date: 11-24-2023
-- Description: This stored procedure creates vectors for news titles and contents.
-- It processes data from the 'news' table and stores the vectors in separate tables.
-- =============================================

CREATE PROCEDURE [dbo].[Create_News_Vector]
AS
BEGIN
    SET NOCOUNT ON; -- Prevents the sending of DONE_IN_PROC messages to the client.

    -- Extract and store title vectors:
    -- First, check if the 'news_titles_vector' table exists and drop it if it does.
    IF OBJECT_ID('dbo.news_titles_vector', 'U') IS NOT NULL
        DROP TABLE dbo.news_titles_vector;

    -- Using a Common Table Expression (CTE) to process title vectors.
    WITH cte AS
    (
        SELECT 
            v.article_id,    
            CAST(tv.[key] AS INT) AS vector_value_id, -- Casting 'key' as INT for vector ID.
            CAST(tv.[value] AS FLOAT) AS vector_value   -- Casting 'value' as FLOAT for vector value.
        FROM 
            dbo.news AS v 
        CROSS APPLY 
            OPENJSON(v.title_vector) tv -- Parses JSON of title_vector in the 'news' table.
    )
    -- Create 'news_titles_vector' table with processed vectors.
    SELECT
        article_id,
        vector_value_id,
        vector_value
    INTO
        dbo.news_titles_vector
    FROM
        cte;

    -- Extract and store content vectors:
    -- Check and drop 'news_contents_vector' table if it exists.
    IF OBJECT_ID('dbo.news_contents_vector', 'U') IS NOT NULL
        DROP TABLE dbo.news_contents_vector;

    -- CTE for processing content vectors.
    WITH cte AS
    (
        SELECT 
            v.article_id,    
            CAST(tv.[key] AS INT) AS vector_value_id, -- Casting 'key' as INT for vector ID.
            CAST(tv.[value] AS FLOAT) AS vector_value   -- Casting 'value' as FLOAT for vector value.
        FROM 
            dbo.news AS v 
        CROSS APPLY 
            OPENJSON(v.content_vector) tv -- Parses JSON of content_vector in the 'news' table.
    )
    -- Create 'news_contents_vector' table with processed vectors.
    SELECT
        article_id,
        vector_value_id,
        vector_value
    INTO
        dbo.news_contents_vector
    FROM
        cte;

    -- Columnstore indexes creation is advised outside the stored procedure.
    -- These indexes optimize data storage and query performance on vector tables.
    CREATE CLUSTERED COLUMNSTORE INDEX cci_news_titles_vector
    ON dbo.news_titles_vector 
	order (article_id);

    CREATE CLUSTERED COLUMNSTORE INDEX cci_news_contents_vector
    ON dbo.news_contents_vector
	order (article_id);
END

I also create a function to search directly to the dataset using the Cosine similarity formula.

SQL

-- This Azure SQL function finds news articles similar to the given content vector.
CREATE function [dbo].[SimilarNewsContentArticles](
    @vector nvarchar(max) -- Input parameter: JSON string representing a content vector.
)
returns table -- The function returns a table.
as
return with 

-- CTE for processing the input vector.
cteVector as
(
    select 
        cast([key] as int) as [vector_value_id], -- Extracts and casts the 'key' from JSON to int.
        cast([value] as float) as [vector_value] -- Extracts and casts the 'value' from JSON to float.
    from 
        openjson(@vector) -- Parses the input JSON vector.
),

-- CTE for calculating similarity scores with existing articles.
cteSimilar as
(
    select top (50)
        v2.article_id, 
        sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance 
        -- Calculates cosine similarity (distance) between vectors.
    from 
        cteVector v1 -- Uses the processed input vector.
    inner join 
        dbo.news_contents_vector v2 
        on v1.vector_value_id = v2.vector_value_id -- Joins with stored article vectors.
    group by
        v2.article_id
    order by
        cosine_distance desc -- Orders by similarity score, descending.
)

-- Final selection combining article data with similarity scores.
select 
    a.*, -- Selects all columns from the news article.
    r.cosine_distance -- Includes the calculated similarity score.
from 
    cteSimilar r -- Uses the similarity scores CTE.
inner join 
    dbo.news a on r.article_id = a.article_id -- Joins with the news articles table.
GO

Finally, you can start querying your table using prompts instead of keywords. This is awesome!

Check out the app I developed, which is designed to assist you in crafting prompts and evaluating your performance using my news dataset. To explore the app, click here.

I also created this GitHub repository with everything I did.

Click on this image to open the GitHub repository.

Conclusion

While the sample in this blog is not optimized for maximum efficiency, it is an excellent starting point for understanding and implementing vector operations in Azure SQL. The process, despite its simplicity, is relatively fast. For instance, querying an eight vCore Azure SQL database can return the fifty most similar articles in just half a second, demonstrating the efficiency and utility of vector operations in Azure SQL Database for data analytics and insights. If you want to know more about this topic, don’t hesitate to get in touch with me.

That’s it for today!

Sources:

Azure-Samples/azure-sql-db-openai: Samples on how to use Azure SQL database with Azure OpenAI (github.com)

Data Wrangler in Microsoft Fabric: A New Tool for Accelerating Data Preparation. Experience the Power Query Feel but with Python Code Output

In the modern digital era, the importance of streamlined data preparation cannot be emphasized enough. For data scientists and analysts, a large portion of time is dedicated to data cleansing and preparation, often termed ‘wrangling.’ Microsoft’s introduction of Data Wrangler in its Fabric suite seems like an answer to this age-old challenge. It promises Power Query’s intuitiveness and Python code outputs’ flexibility. Dive in to uncover the magic of this new tool.

Data preparation is a time-consuming and error-prone task. It often involves cleaning, transforming, and merging data from multiple sources. This can be a daunting task, even for experienced data scientists.

What is Data Wrangler?

Data Wrangler is a state-of-the-art tool Microsoft offers in its Fabric suite explicitly designed for data professionals. At its core, it aims to simplify the data preparation process by automating tedious tasks. Much like Power Query, it offers a user-friendly interface, but what sets it apart is that it can generate Python code as an output. As users interact with the GUI, Python code snippets are generated behind the scenes, making integrating various data science workflows easier.

Advantages of Data Wrangler

User-Friendly Interface: Offers an intuitive GUI for those not comfortable with coding.
Python Code Output: Generates Python code in real-time, allowing flexibility and easy integration.
Time-Saving: Reduces the time spent on data preparation dramatically.
Replicability: Since Python code is generated, it ensures replicable data processing steps.
Integration with Fabric Suite: Can be effortlessly integrated with other tools within the Microsoft Fabric suite.
No-code to Low-code Transition: Ideal for those wanting to transition from a no-code environment to a more code-centric one.

How to use Data Wrangler?

You have to click on Data Science inside the Power BI Service.

You have to select the Notebook button.

You have to insert this code above after the upload of the CSV file in the LakeHouse.

Python

import pandas as pd

# Read a CSV into a Pandas DataFrame from e.g. a public blob store
df = pd.read_csv("/lakehouse/default/Files/Top_1000_Companies_Dataset.csv")

You have to click in the Lauch Data Wrangler and then select the data frame “df”.

On this screen, you can do all transformations you need.

In the end this code will be generate.

Python

# Code generated by Data Wrangler for pandas DataFrame

def clean_data(df):
    # Drop columns: 'company_name', 'url' and 6 other columns
    df = df.drop(columns=['company_name', 'url', 'city', 'state', 'country', 'employees', 'linkedin_url', 'founded'])
    # Drop columns: 'GrowjoRanking', 'Previous Ranking' and 10 other columns
    df = df.drop(columns=['GrowjoRanking', 'Previous Ranking', 'job_openings', 'keywords', 'LeadInvestors', 'Accelerator', 'valuation', 'btype', 'total_funding', 'product_url', 'growth_percentage', 'contact_info'])
    # Drop column: 'indeed_url'
    df = df.drop(columns=['indeed_url'])
    # Performed 1 aggregation grouped on column: 'Industry'
    df = df.groupby(['Industry']).agg(estimated_revenues_sum=('estimated_revenues', 'sum')).reset_index()
    # Sort by column: 'estimated_revenues_sum' (descending)
    df = df.sort_values(['estimated_revenues_sum'], ascending=[False])
    return df

df_clean = clean_data(df.copy())
df_clean.head()

After that, you can create or add to a pipeline or schedule a moment to execute this transformation automatically.

Data Wrangler Extension for Visual Studio Code

Data Wrangler is a code-centric data cleaning tool integrated into VS Code and Jupyter Notebooks. Data Wrangler aims to increase the productivity of data scientists doing data cleaning by providing a rich user interface that automatically generates Pandas code and shows insightful column statistics and visualizations.

This document will cover how to:

Install and setup Data Wrangler
Launch Data Wrangler from a notebook
Use Data Wrangler to explore your data
Perform operations on your data
Edit and export code for data wrangling to a notebook
Troubleshooting and providing feedback

Setting up your environment

If you have not already done so, install Python.
IMPORTANT: Data Wrangler only supports Python version 3.8 or higher.
Install Visual Studio Code.
Install the Data Wrangler extension for VS Code from the Visual Studio Marketplace. For additional details on installing extensions, see Extension Marketplace. The Data Wrangler extension is named Data Wrangler, and Microsoft publishes it.

When you launch Data Wrangler for the first time, it will ask you which Python kernel you would like to connect to. It will also check your machine and environment to see if any required Python packages are installed (e.g., Pandas).

Here is a list of the required versions for Python and Python packages, along with whether they are automatically installed by Data Wrangler:

Name Minimum required version Automatically installed
Python 3.8 No
pandas 0.25.2 Yes
regex* 2020.11.13 Yes

* We use the open-source regex package to be able to use Unicode properties (for example, /\p{Lowercase_Letter}/), which aren’t supported by Python’s built-in regex module (re). Unicode properties make it easier and cleaner to support foreign characters in regular expressions.

Name	Minimum required version	Automatically installed
Python	3.8	No
pandas	0.25.2	Yes
regex*	2020.11.13	Yes

If they are not found in your environment, Data Wrangler will attempt to install them for you via pip. If Data Wrangler cannot install dependencies, the easiest workaround is to run the pip install and then relaunch Data Wrangler manually. These dependencies are required for Data Wrangler such that it can generate Python and Pandas code.

Connecting to a Python kernel

There are currently two ways to connect to a Python kernel, as shown in the quick pick below.

1. Connect using a local Python interpreter

If this option is selected, the kernel connection is created using the Jupyter and Python extensions. We recommend this option for a simple setup and a quick way to start with Data Wrangler.

2. Connect using Jupyter URL and token

A kernel connection is created using JupyterLab APIs if this option is selected. Note that this option has performance benefits since it bypasses some initialization and kernel discovery processes. However, it will also require separate Jupyter Notebook server user management. We recommend this option generally in two cases: 1) if there are blocking issues in the first method and 2) for power users who would like to reduce the cold-start time of Data Wrangler.

To set up a Jupyter Notebook server and use it with this option, follow the steps below:

Install Jupyter. We recommend installing the accessible version of Anaconda with Jupyter installed. Alternatively, follow the official instructions to install it.
In the appropriate environment (e.g., in an Anaconda prompt if Anaconda is used), launch the server with the following command (replace the jupyter token with your secure token):
jupyter notebook --no-browser --NotebookApp.token='<your-jupyter-token>'
In Data Wrangler, connect using the address of the spawned server. E.g., http://localhost:8888, and pass in the token used in the previous step. Once configured, this information is cached locally and can automatically be reused for future connections.

Launching Data Wrangler

Once Data Wrangler has been successfully installed, there are 2 ways to launch it in VS Code.

Launching Data Wrangler from a Jupyter Notebook

If you are in a Jupyter Notebook working with Pandas data frames, you’ll now see a “Launch Data Wrangler” button appear after running specific operations on your data frame, such as df.head(). Clicking the button will open a new tab in VS Code with the Data Wrangler interface in a sandboxed environment.

Important note:
We currently only accept the following formats for launching:

df
df.head()
df.tail()

Where df is the name of the data frame variable. The code above should appear at the end of a cell without any comments or other code after it.

Launching Data Wrangler directly from a CSV file

You can also launch Data Wrangler directly from a local CSV file. To do so, open any VS Code folder with the CSV dataset you’d like to explore. In the File Explorer panel, right-click the. CSV dataset and click “Open in Data Wrangler.”

Using Data Wrangler

The Data Wrangler interface is divided into 6 components, described below.

The Quick Insights header lets you quickly see valuable information about each column. Depending on the column’s datatype, Quick Insights will show the distribution of the data, the frequency of data points, and missing and unique values.

The Data Grid gives you a scrollable pane to view your entire dataset. Additionally, when selecting an operation to perform, a preview will be illustrated in the data grid, highlighting the modified columns.

The Operations Panel is where you can search through Data Wrangler’s built-in data operations. The operations are organized by their top-level category.

The Summary Panel shows detailed summary statistics for your dataset or a specific column if one is selected. Depending on the data type, it will show information such as min, max values, datatype of the column, skew, and more.

The Operation History Panel shows a human-readable list of all the operations previously applied in the current Data Wrangling session. It enables users to undo specific operations or edit the most recent operation. Selecting a step will highlight the data grid changes and show the generated code associated with that operation.

The Code Preview section will show the Python and Pandas code that Data Wrangler has generated when an operation is selected. It will remain blank when no operation is selected. The code can even be edited by the user, and the data grid will highlight the effect on the data.

Example: Filtering a column

Let’s go through a simple example using Data Wrangler with the Titanic dataset to filter adult passengers on the ship.

We’ll start by looking at the quick insights of the Age column, and we’ll notice the distribution of the ages and that the minimum age is 0.42. For more information, we can glance at the Summary panel to see that the datatype is a float, along with additional statistics such as the passengers’ mean and median age.

To filter for only adult passengers, we can go to the Operation Panel and search for the keyword “Filter” to find the Filter operation. (You can also expand the “Sort and filter” category to find it.)

Once we select an operation, we are brought into the Operation Preview state, where parameters can be modified to see how they affect the underlying dataset before applying the operation. In this example, we want to filter the dataset only to include adults, so we’ll want to filter the Age column to only include values greater than or equal to 18.

Once the parameters are entered in the operation panel, we can see a preview of what will happen to the data. We’ll notice that the minimum value in age is now 18 in the Quick Insights, along with a visual preview of the rows that are being removed, highlighted in red. Finally, we’ll also notice the Code Preview section automatically shows the code that Data Wrangler produced to execute this Filter operation. We can edit this code by changing the filtered age to 21, and the data grid will automatically update accordingly.

After confirming that the operation has the intended effect, we can click Apply.

Editing and exporting code

Each step of the generated code can be modified. Changes to the data will be highlighted in the grid view as you make changes.

Once you’re done with your data cleaning steps in Data Wrangler, there are 3 ways to export your cleaned dataset from Data Wrangler.

Export code back to Notebook and exit: This creates a new cell in your Jupyter Notebook with all the data cleaning code you generated packaged into a clean Python function.
Export data as CSV: This saves the cleaned dataset as a new CSV file onto your machine.
Copy code to clipboard: This copies all the code generated by Data Wrangler for the data cleaning operations.

Note: If you launched Data Wrangler directly from a CSV, the first export option will be to export the code into a new Jupyter Notebook.

Data Wrangler operations

These are the Data Wrangler operations currently supported in the initial launch of Data Wrangler (with many more to be added soon).

Operation	Description
Sort values	Sort column(s) ascending or descending
Filter	Filter rows based on one or more conditions
Calculate text length	Create new column with values equal to the length of each string value in a text column
One-hot encode	Split categorical data into a new column for each category
Multi-label binarizer	Split categorical data into a new column for each category using a delimiter
Create column from formula	Create a column using a custom Python formula
Change column type	Change the data type of a column
Drop column	Delete one or more columns
Select column	Choose one or more columns to keep and delete the rest
Rename column	Rename one or more columns
Drop missing values	Remove rows with missing values
Drop duplicate rows	Drops all rows that have duplicate values in one or more columns
Fill missing values	Replace cells with missing values with a new value
Find and replace	Split a column into several columns based on a user-defined delimiter
Group by column and aggregate	Group by columns and aggregate results
Strip whitespace	Capitalize the first character of a string with the option to apply to all words.
Split text	Remove whitespace from the beginning and end of the text
Convert text to capital case	Automatically create a column when a pattern is detected from your examples.
Convert text to lowercase	Convert text to lowercase
Convert text to uppercase	Convert text to UPPERCASE
String transform by example	Automatically perform string transformations when a pattern is detected from the examples you provide
DateTime formatting by example	Automatically perform DateTime formatting when a pattern is detected from the examples you provide
New column by example	Automatically create a column when a pattern is detected from the examples you provide.
Scale min/max values	Scale a numerical column between a minimum and maximum value
Custom operation	Automatically create a new column based on examples and the derivation of existing column(s)

Limitations

Data Wrangler currently supports only Pandas DataFrames. Support for Spark DataFrames is in progress.
Data Wrangler’s display works better on large monitors, although different interface portions can be minimized or hidden to accommodate smaller screens.

Conclusion

Data Wrangler in Microsoft Fabric is undeniably a game-changer in data preparation. It combines the best of both worlds by offering the simplicity of Power Query with the robustness and flexibility of Python. As data continues to grow in importance, tools like Data Wrangler that simplify and expedite the data preparation process will be indispensable for organizations aiming to stay ahead.

That’s it for today!

Sources:

https://medium.com/towards-data-engineering/data-wrangler-in-fabric-simplifying-data-prep-with-no-code-ab4fe7429b49

https://radacad.com/fabric-data-wrangler-a-tool-for-data-scientist

https://learn.microsoft.com/en-us/fabric/data-science/data-wrangler

https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler

https://github.com/microsoft/vscode-data-wrangler

OpenAI has unveiled a groundbreaking new feature, the Code Interpreter, accessible to all ChatGPT Plus users. Check out my experiments using the 2739 edition of BRPTO’s Patent Gazette

Code Interpreter is an innovative extension of ChatGPT, now available to all subscribers of the ChatGPT Plus service. This tool boasts the ability to execute code, work with uploaded files, analyze data, create charts, edit files, and carry out mathematical computations. The implications of this are profound, not just for academics and coders, but for anyone looking to streamline their research processes. Code Interpreter transcends the traditional scope of AI assistants, which have primarily been limited to generating text responses. It leverages large language models, the AI technology underpinning ChatGPT, to provide a general-purpose toolbox for problem-solving.

Code Interpreter will be available to all ChatGPT Plus users over the next week.

It lets ChatGPT run code, optionally with access to files you've uploaded. You can ask ChatGPT to analyze data, create charts, edit files, perform math, etc.

Plus users can opt in via settings. pic.twitter.com/IjH5JBqe5B
— OpenAI (@OpenAI) July 6, 2023

What is the Code Interpreter?

The Code Interpreter Plugin for ChatGPT is a multifaceted addition that provides the AI chatbot with the capacity to handle data and perform a broad range of tasks. This plugin equips ChatGPT with the ability to generate and implement code in natural language, thereby streamlining data evaluation, file conversions, and more. Pioneering users have experienced its effectiveness in activities like generating GIFs and examining musical preferences. The potential of the Code Interpreter Plugin is enormous, having the capability to revolutionize coding processes and unearth novel uses. By capitalizing on ChatGPT’s capabilities, users can harness the power of this plugin, sparking a voyage of discovery and creativity.

Professor Ethan Mollick from the Wharton School of the University of Pennsylvania shares his experiences with using the Code Interpreter

Artificial intelligence is rapidly revolutionizing every aspect of our lives, particularly in the world of data analytics and computational tasks. This transition was recently illuminated by Wharton Professor Ethan Mollick who commented, “Things that took me weeks to master in my PhD were completed in seconds by the AI.” This is not just a statement about time saved or operational efficiency, but it speaks volumes about the growing capabilities of AI technologies, specifically OpenAI’s new tool for ChatGPT – Code Interpreter.

Mollick, an early adopter of AI and an esteemed academic at the Wharton School of the University of Pennsylvania lauded Code Interpreter as the most significant application of AI in the sphere of complex knowledge work. Not only does it complete intricate tasks in record time, but Mollick also noticed fewer errors than those typically expected from human analysts.

One might argue that Code Interpreter transcends the traditional scope of AI assistants, which have primarily been limited to generating text responses. It leverages large language models, the AI technology underpinning ChatGPT, to provide a general-purpose toolbox for problem-solving.

Mollick commended Code Interpreter’s use of Python, a versatile programming language known for its application in software building and data analysis. He pointed out that it closes some of the gaps in language models as the output is not entirely text-based. The code is processed through Python, which promptly flags any errors.

In practice, when given a dataset on superheroes, Code Interpreter could clean and merge the data seamlessly, with an admirable effort to maintain accuracy. This process would have been an arduous task otherwise. Additionally, it allows a back-and-forth interaction during data visualization, accommodating various alterations and enhancements.

Remarkably, Code Interpreter doesn’t just perform pre-set analyses but recommends pertinent analytical approaches. For instance, it conducted predictive modeling to anticipate a hero’s potential powers based on other factors. Mollick was struck by the AI’s human-like reasoning about data, noting the AI’s observation that the powers were often visually noticeable as they derived from the comic book medium.

Beyond its technical capabilities, Code Interpreter democratizes access to complex data analysis, making it accessible to more people, thereby transforming the future of work. It saves time and reduces the tedium of repetitive tasks, enabling individuals to focus on more fulfilling, in-depth work.

Here are 10 examples of how you can use Code Interpreter for data analysis:

Analyzing customer feedback data to identify trends and patterns.
Creating interactive dashboards and reports for business intelligence purposes.
Cleaning and transforming datasets for machine learning models.
Extracting insights from social media data to inform marketing strategies.
Generating charts and graphs to visualize sales data.
Analyzing website traffic data to optimize the user experience.
Creating custom functions and scripts for specific data analysis tasks.
Performing statistical analysis on survey data.
Automating repetitive data analysis tasks with Python scripts.
Creating custom visualizations for presentations and reports.

How to use it? Follow my own experiments

Initially, you must upload a file of your choice. Following this, you are permitted to pose queries concerning it. The interpreter for the code will generate and run a Python script to address nearly all inquiries you have. The entire process is incredibly swift. I conducted a few trials using the XML file from BRTPTO’s 2739 Patent Gazette. The experience was truly incredible. There’s no need for any understanding of coding or Python. The code interpreter simply provides the results. If you wish, you can view the Python script.

By selecting the “Show Work” option, the Python script will become visible to you.

He executed the work flawlessly!

This is the screen of the CSV file.

Here is the Python code generated by the Code Interpreter to create the CSV file:

Python

import pandas as pd
import xml.etree.ElementTree as ET

# Load the XML file
tree = ET.parse("/mnt/data/Patente_2739_04072023.xml")
root = tree.getroot()

# Function to extract data from XML into a dictionary
def extract_data(root):
    data = []
    for despacho in root.findall('despacho'):
        row = {}
        row['despacho_codigo'] = despacho.find('codigo').text
        row['despacho_titulo'] = despacho.find('titulo').text
        row['despacho_comentario'] = despacho.find('comentario').text if despacho.find('comentario') is not None else ""
        processo_patente = despacho.find('processo-patente')
        row['processo_numero'] = processo_patente.find('numero').text
        row['processo_data_deposito'] = processo_patente.find('data-deposito').text if processo_patente.find('data-deposito') is not None else ""
        row['processo_concessao'] = processo_patente.find('concessao').text if processo_patente.find('concessao') is not None else ""
        titulars = processo_patente.findall('titular-lista/titular')
        row['titular_nome_completo'] = "; ".join([titular.find('nome-completo').text for titular in titulars])
        row['titular_endereco_pais'] = "; ".join([titular.find('endereco/pais/sigla').text for titular in titulars])
        data.append(row)
    return data

# Extract data from XML
data = extract_data(root)

# Convert to DataFrame
df = pd.DataFrame(data)

# Save DataFrame to CSV
df.to_csv('/mnt/data/patente.csv', index=False)

df.head()  # Display the first few rows of the DataFrame

# Extract distinct application/patent numbers
distinct_application_numbers = df['processo_numero'].unique()

# Convert to DataFrame
df_distinct_application_numbers = pd.DataFrame(distinct_application_numbers, columns=['processo_numero'])

# Save DataFrame to CSV
df_distinct_application_numbers.to_csv('/mnt/data/distinct_application_numbers.csv', index=False)

df_distinct_application_numbers.head()  # Display the first few rows of the DataFrame

This video can demonstrate the capabilities of the Code Interpreter.

You can find more information on the official Open AI site by clicking here.

Conclusion

Code Interpreter is a powerful tool that is making data analysis accessible for everyone with ChatGPT Plus. By allowing users to run code snippets within their chat sessions, it enables them to perform a wide range of data analysis tasks quickly and easily. Whether you’re analyzing customer feedback data or creating custom visualizations for presentations and reports, Code Interpreter has something to offer everyone.

Code Interpreter invites us to consider how we can leverage such advancements across various sectors impacted by AI. Indeed, Code Interpreter signifies the dawn of a new era in artificial intelligence and computational capabilities. So why not give it a try today?

That’s it for today!

Sources:

Wharton professor sees future of work in new ChatGPT tool | Fortune

https://openai.com/blog/chatgpt-plugins#code-interpreter

https://www.searchenginejournal.com/code-interpreter-chatgpt-plus/490980/#close

https://www.gov.br/inpi/pt-br

The Future of Data Analytics: An Introduction to Microsoft Fabric

Microsoft Fabric, launched on May 24-25 of 2023 at the Microsoft Build event, is an end-to-end data and analytics platform that combines Microsoft’s OneLake data lake, Power BI, Azure Synapse, and Azure Data Factory into a unified software as a service (SaaS) platform. It’s a one-stop solution designed to serve various data professionals including data engineers, data warehousing professionals, data scientists, data analysts, and business users, enabling them to collaborate effectively within the platform to foster a healthy data culture across their organizations.

What are the Microsoft Fabric key features?

Data Factory – Microsoft’s Azure Data Factory is a powerful tool that combines the simplicity of Power Query with Azure Data Factory’s scale. It provides over 200 native connectors for data linkage from on-premises and cloud-based sources. Data Factory enables the scheduling and orchestration of notebooks and Spark jobs.

Data Engineering – Leveraging the extensive capabilities of Spark, data engineering in Microsoft Fabric provides premier authoring experiences and facilitates large-scale data transformations. It plays a crucial role in democratizing data through the lakehouse model. Moreover, integration with

Data Science – The data science capability in Microsoft Fabric aids in building, deploying, and operationalizing machine learning models within the Fabric framework. It interacts with Azure Machine Learning for built-in experiment tracking and model registry, empowering data scientists to enhance organizational data with predictions that business analysts can incorporate into their BI reports, thereby transitioning from descriptive to predictive insights.

Data Warehouse – The data warehousing component of Microsoft Fabric offers top-tier SQL performance and scalability. It features a full separation of computing and storage for independent scaling and native data storage in the open Delta Lake format.

Real-Time Analytics – Observational data, acquired from diverse sources like apps, IoT devices, human interactions, and more, represents the fastest-growing data category. This semi-structured, high-volume data, often in JSON or Text format with varying schemas, presents challenges for conventional data warehousing platforms. However, Microsoft Fabric’s Real-Time Analytics offers a superior solution for analyzing such data.

Power BI – Recognised as a leading Business Intelligence platform worldwide, Power BI in Microsoft Fabric enables business owners to access all Fabric data swiftly and intuitively for data-driven decision-making.

What are the Advantages of Microsoft Fabric?

Unified Platform: Microsoft Fabric provides a unified platform for different data analytics workloads such as data integration, engineering, warehousing, data science, real-time analytics, and business intelligence. This can foster a well-functioning data culture across the organization as data engineers, warehousing professionals, data scientists, data analysts, and business users can collaborate within Fabric.

Multi-cloud Support: Fabric is designed with a multi-cloud approach in mind, with support for data in Amazon S3 and (soon) Google Cloud Platform. This means that users are not restricted to using data only from Microsoft’s ecosystem, providing flexibility.

Accessibility: Microsoft Fabric is currently available in public preview, and anyone can try the service without providing their credit card information. Starting from July 1, Fabric will be enabled for all Power BI tenants.

AI Integration: The private preview of Copilot in Power BI will combine advanced generative AI with data, enabling users to simply describe the insights they need or ask a question about their data, and Copilot will analyze and pull the correct data into a report, turning data into actionable insights instantly.

Microsoft Fabric – Licensing and Pricing

Microsoft Fabric capacities are available for purchase in the Azure portal. These capacities provide the compute resources for all the experiences in Fabric from the Data Factory to ingest and transform to Data Engineering, Data Science, Data Warehouse, Real-Time Analytics, and all the way to Power BI for data visualization. A single capacity can power all workloads concurrently and does not need to be pre-allocated across the workloads. Moreover, a single capacity can be shared among multiple users and projects, without any limitations on the number of workspaces or creators that can utilize it.

To gain access to Microsoft Fabric, you have three options:

Leverage your existing Power BI Premium subscription by turning on the Fabric preview switch. All Power BI Premium capacities can instantly power all the Fabric workloads with no additional action required. If you already have a Power BI Premium subscription, you can simply turn on the Fabric preview switch. This means you can enable Microsoft Fabric’s capabilities as part of your existing Power BI Premium subscription without having to do anything else. All the capacities you have with your Power BI Premium subscription can be used to power the full range of workloads in Microsoft Fabric. In other words, you can use your existing Power BI Premium resources to run all of the data and analytics tasks that Microsoft Fabric can handle.
Start a Fabric trial if your tenant supports trials. If you’re not sure about committing to Microsoft Fabric yet, you can start a trial if your tenant (an instance of Azure Active Directory) supports it. A trial allows you to test the service before deciding to purchase. During the trial period, you can explore the full capabilities of Microsoft Fabric, such as data ingestion, data transformation, data engineering, data science, data warehouse operations, real-time analytics, and data visualization with Power BI.
Purchase a Fabric pay-as-you-go capacity from the Azure portal. If you decide that Microsoft Fabric suits your needs and you don’t have a Power BI Premium subscription, you can directly purchase a Fabric capacity on a pay-as-you-go basis from the Azure portal. The pay-as-you-go model is flexible because it allows you to pay for only the compute and storage resources you use. Microsoft Fabric capacities come in different sizes, from F2 to F2048, representing 2 – 2048 Capacity Units (CU). Your bill will be determined by the amount of computing you provision (i.e., the size of the capacity you choose) and the amount of storage you use in OneLake, the data lake built into Microsoft Fabric. This model also allows you to easily scale your capacities up and down to adjust their computing power, and even pause your capacities when not in use to save on your bills.

Microsoft Fabric is a unified product for all your data and analytics workloads. Rather than provisioning and managing separate compute for each workload, with Fabric, your bill is determined by two variables: the amount of compute you provision and the amount of storage you use.

Follow the capacities that you can buy in the Azure portal:

Check out this video from Guy and Cube which breaks down the details on pricing and licensing.

How to activate the Microsoft Fabric Trial version?

Step 1

You will observe that asides from the OneLake icon at the top left, everything looks normal if you are familiar with Power BI Service.

Step 2

Enable Microsoft Fabric for your Tenant

Your Screen will Look like this

So far, we’ve only enabled Microsoft Fabric at the tenant level. This doesn’t give full access to Fabric resources as can be seen in the illustration below

So, Let’s upgrade the Power BI License to Microsoft Fabric Trial

For a smoother experience, You should create a new Workspace and add Microsoft Fabric Trial License as can be seen below

As you can see, while creating a new Workspace, you can now Assign Fabric Trial License to it. Upon creation, we are able to take full advantage of Microsoft Fabric

This video by Guy and Cube explains the steps for getting the Microsoft Fabric Trial.

Conclusion

Microsoft Fabric is currently in preview but already represents a significant advancement in the field of data and analytics, offering a unified platform that brings together various tools and services. It enables a smooth and collaborative experience for a variety of data professionals, fostering a data-driven culture within organizations. Let´s wait for the next steps from Microsoft.

That’s it for today!