Cost-Effective Text Embedding: Leveraging Ollama Local Models with Azure SQL Databases

Embedding text using a local model can provide significant cost advantages and flexibility over cloud-based services. In this blog post, we explore how to set up and use a local model for text embedding and how this approach can be integrated with Azure SQL databases for advanced querying capabilities.

Cost Comparison: Open AI text-embedding-ada-002 pay Model vs. Local Model Setup Cost

When choosing between a paid service and setting up a local model for text embedding, it’s crucial to consider the cost implications based on the scale of your data and the frequency of usage. Below is a detailed comparison of the costs of using a paid model versus establishing a local one.

Pay Model Cost Estimate:

Open AI text-embedding-ada-002:

Using a paid model like OpenAI’s Ada V2 for embedding 1 terabyte of OCR texts would cost around $25,000. This estimation is based on converting every 4 characters into one token, which might vary depending on the content and structure of the OCR texts.

Local Model Cost Estimate:

Setup Costs:

The initial investment for setting up a local model can range from $4,050 to $12,750, depending on the selection of components, from mid-range to high-end. This one-time cost can be amortized over many uses and datasets, potentially offering a more cost-effective solution in the long run, especially for large data volumes.

Overall Financial Implications

While the upfront cost for a local model might seem high, it becomes significantly more economical with increased data volumes and repeated use. In contrast, the cost of using a pay model like OpenAI’s text-embedding-ada-002 scales linearly with data volume, leading to potentially high ongoing expenses.

Considering these factors, the local model offers a cost advantage and greater control over data processing and security, making it an attractive option for organizations handling large quantities of sensitive data.

Why I Have Decided to Use a Local Model?

Cost and data volume considerations primarily drove the decision to use a local model for text embedding. With over 20 terabytes of data, including 1 terabyte of OCR text to embed, the estimated cost of using a commercial text-embedding model like OpenAI’s text-embedding-ada-002 would be around USD 25,000. By setting up a local model, we can process our data at a fraction of this cost, reducing expenses by 49% to 84%.

Exploring Local Models: Testing BGE-M3, MXBAI-EMBED-LARGE, NOMIC-EMBED-TEXT, and text-embedding-ada-002 from Open AI.

I encountered some intriguing results in my recent tests with local embedding models BGE-M3 and NOMIC-EMBED-TEXT. Both models showed an accuracy below 0.80 when benchmarked against OpenAI’s “Text-embedding-ada-002.” This comparison has sparked a valuable discussion about the capabilities and limitations of different embedding technologies.

How to Choose the Best Model for Your Needs?

When considering open-source embedding models like NOMIC-EMBED-TEXT, BGE-M3, and MXBAI-EMBED-LARGE, specific strengths and applications that make them suitable for various machine learning tasks should be considered.

1. NOMIC-EMBED-TEXT: This model is specifically designed for handling long-context text, making it suitable for tasks that involve processing extensive documents or content that benefits from understanding broader contexts. It achieves this by training on full Wikipedia articles and various large-scale question-answering datasets, which helps it capture long-range dependencies.

2. BGE-M3: Part of the BGE (Beijing Academy of Artificial Intelligence) series, this model is adapted for sentence similarity tasks. It’s built to handle multilingual content effectively, which makes it a versatile choice for applications requiring understanding or comparing sentences across different languages.

3. MXBAI-EMBED-LARGE: This model is noted for its feature extraction capabilities, making it particularly useful for tasks that require distilling complex data into simpler, meaningful representations. Its training involves diverse datasets, enhancing its generalization across text types and contexts.

Each model brings unique capabilities, such as handling longer texts or providing robust multilingual support. When choosing among these models, consider the specific needs of your project, such as the length of text you need to process, the importance of multilingual capabilities, and the type of machine learning tasks you aim to perform (e.g., text similarity, feature extraction). Testing them with specific data is crucial to determine which model performs best in your context.

In our analysis, we’ve compared various results and identified the best open-source model to use compared to the OpenAI’s Text-embedding-ada-002.

We executed this query using the keyword ‘Microsoft’ to search the vector table and compare the content of Wikipedia articles.

SQL
declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'
select w.title, w.text from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 1, 0.25) 
order by dot_product desc) w
order by w.title
go

We utilized the KMeans compute node for text similarity analysis, focusing on a single cluster search. For a detailed, step-by-step guide on creating this dataset, please refer to the article I shared at the end of this article.

Follow the results overview:

To calculate the percentage of similarity of each model with “Text-embedding-ada-002”, we’ll determine how many keywords match between “Text-embedding-ada-002” and the other models, then express this as a percentage of the total keywords in “Text-embedding-ada-002”. Here’s the updated table with the percentages:

Follow the comparison table:

  1. Text-embedding-ada-002 Keywords Total: 10 (100% is based on these keywords).
  2. Matching Keywords:
       – BGE-M3: Matches 7 out of 10 keywords of Text-embedding-ada-002.
       – NOMIC-EMBED-TEXT: Matches 3 out of 10 keywords of Text-embedding-ada-002.
       – MXBAI-EMBED-LARGE: Matches 1 out of 10 keywords of Text-embedding-ada-002.

This table illustrates that the BGE-M3 model is similar to “Text-embedding-ada-002,” with 70% of the keywords matching. It is followed by “NOMIC-EMBED-TEXT” at 30% and “MXBAI-EMBED-LARGE,” with the least similarity at 10%.

How does it perform when doing an approximate search with 1, 4, 8, and 16 clusters?

We execute this query within the Azure database to perform this test across each database and model we use:

SQL
create table #trab ( linha varchar( 200) null )

insert into #trab (linha) values ('Model: mxbai-embed-large')

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 1 cluster')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 1, 0.25) 
order by dot_product desc) w
order by w.title
go

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 4 clusters')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 4, 0.25) 
order by dot_product desc) w
order by w.title
go

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 8 clusters')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 8, 0.25) 
order by dot_product desc) w
order by w.title
go

declare @v nvarchar(max)
select @v = content_vector from dbo.wikipedia_articles_embeddings where title = 'Microsoft'

insert into #trab (linha) values ('')
insert into #trab (linha) values ('Search with 16 clusters')

insert into #trab (linha)
select w.title from 
(select top (10) id, title, text, dot_product
from [$vector].find_similar$wikipedia_articles_embeddings$content_vector(@v, 16, 0.25) 
order by dot_product desc) w
order by w.title
go

select * from #trab

drop table #trab

Follow the results overview:

Based on the previous detailed list, here are the calculations for the percentage of similarity:

1. Total Distinct Keywords in Text-embedding-ada-002: 10 (100% based on these keywords)

2. Keywords in each Cluster Search:

   – BGE-M3: 5 keywords (Microsoft, Microsoft Office, Microsoft Windows, Microsoft Word, MSN)

   – NOMIC-EMBED-TEXT: 4 keywords (Microsoft, MSN, Nokia, Outlook.com)

   – MXBAI-EMBED-LARGE: 2 keywords (Microsoft, Nokia)

Here’s the updated table with the percentage similarity for searches with 1, 4, 8, and 16 clusters:

This table shows the similarity percentages for each model across different cluster configurations compared to the “text-embedding-ada-002” model. Each model retains a consistent similarity percentage across all cluster numbers, indicating that the cluster configuration did not affect the keywords searched for in these cases.

To execute the Python code to embed the vectors, first, you have to install Ollama

How Did You Set Up a Local Model Using Ollama?

To run an Ollama model with your GPU, you can use the official Docker image provided by Ollama. The Docker image supports Nvidia GPUs and can be installed using the NVIDIA Container Toolkit. Here are the steps to get started:

  1. Install Docker: Download and install Docker Desktop or Docker Engine, depending on your operating system.
  2. Select and Pull the Ollama Model: Choose a preferred model from the Ollama library, such as nomic-embed-text or mxbai-embed-large, and pull it using the following command: docker pull ollama/ollama..
  3. Run the Ollama Docker Image: Execute Docker run commands to set up the Ollama container. You can configure it specifically for either CPU or Nvidia GPU environments. Run the Docker container with the following command: docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama.
  4. You can now run the Ollama model using the following command: docker exec -it ollama ollama run nomic-embed-text  or docker exec -it ollama ollama run mxbai-embed-large .
  5. Access and Use the Model: To start interacting with your model, utilize the Ollama WebUI by navigating to the local address provided (typically http://localhost:11434).

Please note that the above commands assume you have already installed Docker on your system. If you haven’t installed Docker yet, you can download it from the official Docker website.

You can also download and install Ollama on Windows:

How do you convert text into embedding using the local model using Ollama?

After setting up your local model with Ollama, you can use the following Python script to convert text into embeddings:

Python
# Importing necessary libraries and modules
import os
import pyodbc  # SQL connection library for Microsoft databases
import requests  # For making HTTP requests
from dotenv import load_dotenv  # To load environment variables from a .env file
import numpy as np  # Library for numerical operations
from sklearn.preprocessing import normalize  # For normalizing data
import json  # For handling JSON data
from db.utils import NpEncoder  # Custom JSON encoder for numpy data types

# Load environment variables from a .env file located in the same directory
load_dotenv()

# This is the connection string for connecting to the Azure SQL database we are getting from the environment variables
#MSSQL='Driver={ODBC Driver 17 for SQL Server};Server=localhost;Database=<DATABASE NAME>;Uid=<USER>;Pwd=<PASSWOPRD>;Encrypt=No;Connection Timeout=30;'

# Retrieve the database connection string from environment variables
dbconnectstring = os.getenv('MSSQL')

# Establish a connection to the Azure SQL database using the connection string
conn = pyodbc.connect(dbconnectstring)

def get_embedding(text, model):
    # Prepare the input text by truncating it or preprocessing if needed
    truncated_text = text

    # Make an HTTP POST request to a local server API to get embeddings for the input text
    res = requests.post(url='http://localhost:11434/api/embeddings',
                        json={
                            'model': model, 
                            'prompt': truncated_text
                        }
    )
    
    # Extract the embedding from the JSON response
    embeddings = res.json()['embedding']
    
    # Convert the embedding list to a numpy array
    embeddings = np.array(embeddings)    
    
    # Normalize the embeddings array to unit length
    nc = normalize([embeddings])
        
    # Convert the numpy array back to JSON string using a custom encoder that handles numpy types
    return json.dumps(nc[0], cls=NpEncoder )

def update_database(id, title_vector, content_vector):
    # Obtain a new cursor from the database connection
    cursor = conn.cursor()

    # Convert numpy array embeddings to string representations for storing in SQL
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # SQL query to update the embeddings in the database
    cursor.execute("""
        UPDATE wikipedia_articles_embeddings
        SET title_vector = ?, content_vector = ?
        WHERE id = ?
    """, (title_vector_str, content_vector_str, id))
    conn.commit()  # Commit the transaction to the database

def embed_and_update(model):
    # Get a cursor from the database connection
    cursor = conn.cursor()
    
    # Retrieve articles from the database that need their embeddings updated
    cursor.execute("select id, title, text from wikipedia_articles_embeddings where title_vector = '' or content_vector = '' order by id desc")
    
    for row in cursor.fetchall():
        id, title, text = row
        
        # Get embeddings for title and text
        title_vector = get_embedding(title, model)
        content_vector = get_embedding(text, model)
        
        # Print the progress with length of the generated embeddings
        print(f"Embedding article {id} - {title}", "len:", len(title_vector), len(content_vector))
        
        # Update the database with new embeddings
        update_database(id, title_vector, content_vector)

# Call the function to update embeddings using the 'nomic-embed-text' model
embed_and_update('nomic-embed-text')

# To use another model, uncomment and call the function with the different model name
# embed_and_update('mxbai-embed-large')

I’ve also created a GitHub repository with these codes; you can access it at this link.

Download the pre-calculated embeddings using OpenAI’s text-embedding-ada-002

The pre-calculated embeddings with OpenAI’s text-embedding-ada-002, both for the title and the body, of a selection of Wikipedia articles, is made available by Open AI here:

https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip

Once you have successfully embedded your text, I recommend exploring two of my blog posts that detail how to create a vector database for prompting and searching. These posts provide step-by-step guidance on utilizing Azure SQL alongside cosine similarity and KMeans algorithms for efficient and effective data retrieval.

Azure SQL Database now has native vector support

You can sign up for the private preview at this link.

This article, published by Davide Mauri and Pooja Kamath during this week’s Microsoft Build event, provides all the information.

Announcing EAP for Vector Support in Azure SQL Database – Azure SQL Devs’ Corner (microsoft.com)

Conclusion

Embedding text locally using models like Ollama presents a cost-effective, scalable solution for handling large volumes of data. By integrating these embeddings into Azure SQL databases, organizations can leverage generative AI to enhance their querying capabilities, making extracting meaningful insights from vast datasets easier. The outlined process ensures significant cost savings and enhances data security and processing efficiency.

This approach is a technical exercise and a strategic asset that can drive better decision-making and innovation across various data-intensive applications.

That’s it for today!

Sources

GitHub – Azure-Samples/azure-sql-db-vectors-kmeans: Use KMeans clustering to speed up vector search in Azure SQL DB

Vector Similarity Search with Azure SQL database and OpenAI | by Davide Mauri | Microsoft Azure | Medium

Ollama

How to Install and Run Ollama with Docker: A Beginner’s Guide – Collabnix

Leveraging KMeans Compute Node for Text Similarity Analysis through Vector Search in Azure SQL – Tech News & Insights (lawrence.eti.br)

Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide How to Use Generative AI to Prompt Queries in Datasets – Tech News & Insights (lawrence.eti.br)

GitHub – LawrenceTeixeira/embedyourlocalmodel

The New Black Gold: How Data Became the Most Valuable Asset in Tech

In the annals of history, the term “black gold” traditionally referred to oil, a commodity that powered the growth of modern economies, ignited wars, and led to the exploration of uncharted territories. Fast forward to the 21st century, and a new form of black gold has emerged, one that is intangible yet infinitely more powerful: data. This precious commodity has become the cornerstone of technological innovation, driving the evolution of artificial intelligence (AI), shaping economies, and transforming industries. Let’s dive into how data ascended to its status as the most valuable asset in technology.

The Economic Power of Data

Data has transcended its role as a mere resource for business insights and operations, becoming a pivotal economic asset. Companies that possess vast amounts of data or have the capability to efficiently process and analyze data hold significant economic power and influence. This influence is not just limited to the tech industry but extends across all sectors, including healthcare, finance, and manufacturing, to name a few. Leveraging data effectively can lead to groundbreaking innovations, disrupt industries, and create new markets.

Image sourced from this website: Value in the digital economy: data monetised (nationthailand.com)

The economic potential of data is immense. The ability to harness insights from data translates into a competitive advantage for businesses. Predictive analytics, driven by data, enable companies to forecast customer behavior, optimize pricing strategies, and streamline supply chains. Data analysis is critical to personalized medicine, diagnostics, and drug discovery in healthcare. In the financial sector, data-driven algorithms power trading strategies and risk management assessments. Data’s reach extends beyond traditional industries, transforming fields like agriculture through precision farming and intelligent sensors.

The rise of data-driven decision-making has given birth to a thriving data economy. Companies specialize in aggregating, cleansing, and enriching datasets, turning them into marketable assets. The development of machine learning and artificial intelligence tools, combined with big data, enables more sophisticated and transformative data usage. Industries across the spectrum recognize the power of data, fueling investment in technologies and talent, with data scientists and analysts finding themselves in high demand.

The Rise of Data as a Commodity

The rise of data as a commodity represents a significant shift in the global economy, where the value of intangible assets, specifically digital data, has surpassed that of traditional physical commodities. This transition reflects the increasing importance of data in driving innovation, enhancing productivity, and fostering economic growth.

According to International Banker, the value of data has escalated because of the vast volumes available to financial services and other organizations, coupled with the nearly limitless processing power of cloud computing. This has enabled the manipulation, integration, and analysis of diverse data sources, transforming data into a critical asset for the banking sector and beyond. Robotics and Automation News further illustrates this by noting the exponential rise in Internet-connected devices, which has led to the generation of staggering amounts of data daily. As of 2018, more than 22 billion Internet-of-Things (IoT) devices were active, highlighting the vast scale of data generation and its potential value.

MIT Technology Review emphasizes data as a form of capital, akin to financial and human capital, which is essential for creating new digital products and services. This perspective is supported by studies indicating that businesses prioritizing “data-driven decision-making” achieve significantly higher output and productivity. Consequently, companies rich in data assets, such as Airbnb, Facebook, and Netflix, have redefined competition within their industries, underscoring the need for traditional companies to adopt a data-centric mindset.

Data transformation into a valuable commodity is not just a technological or economic issue but also entails significant implications for privacy, security, and governance. As organizations harness the power of data to drive business and innovation, the ethical considerations surrounding data collection, processing, and use become increasingly paramount.

In summary, the rise of data as a commodity marks a pivotal development in the digital economy, highlighting the critical role of data in shaping future economic landscapes, driving innovation, and redefining traditional industry paradigms.

The Challenges and Ethics of Data Acquisition

The discourse on the challenges and ethics of data acquisition and the application of artificial intelligence (AI) spans various considerations, reflecting the intricate web of moral, societal, and legal issues that modern technology presents. As AI becomes increasingly integrated into various facets of daily life, its potential to transform industries, enhance efficiency, and contribute to societal welfare is matched by significant ethical and societal challenges. These challenges revolve around privacy, discrimination, accountability, transparency, and the overarching role of human judgment in the age of autonomous decision-making systems (OpenMind, Harvard Gazette).

The ethical use of data and AI involves a nuanced approach that encompasses not just the legal compliance aspect but also the moral obligations organizations and developers have towards individuals and society at large. This includes ensuring privacy through anonymization and differential privacy, promoting inclusivity by actively seeking out diverse data sources to mitigate systemic biases, and maintaining transparency about how data is collected, used, and shared. Ethical data collection practices emphasize the importance of the data life cycle, ensuring accountability and accuracy from the point of collection to eventual disposal (Omdena, ADP).

Moreover, the ethical landscape of AI and data use extends to addressing concerns about unemployment and the societal implications of automation. As AI continues to automate tasks traditionally performed by humans, questions about the future of work, socio-economic inequality, and environmental impacts come to the forefront. Ethical considerations also include automating decision-making processes, which can either benefit or harm society based on the ethical standards encoded within AI systems. The potential for AI to exacerbate existing disparities and the risk of moral deskilling among humans as decision-making is increasingly outsourced to machines underscores the need for a comprehensive ethical framework governing AI development and deployment (Markkula Center for Applied Ethics).

In this context, the principles of transparency, fairness, and responsible stewardship of data and AI technologies form the foundation of ethical practice. Organizations are encouraged to be transparent about their data practices, ensure fairness in AI outcomes to avoid amplifying biases, and engage in ethical deliberation to navigate the complex interplay of competing interests and values. Adhering to these principles aims to harness the benefits of AI and data analytics while safeguarding individual rights and promoting societal well-being (ADP).

How is the “new black gold” being utilized?

1. AI-driven facial Emotion Detection
  • Overview: This application uses deep learning algorithms to analyze facial expressions and detect emotions. This technology provides insights into human emotions and behavior and is used in various fields, including security, marketing, and healthcare.
  • Data Utilization: By training on vast datasets of facial images tagged with emotional states, the AI can learn to identify subtle expressions, showcasing the critical role of diverse and extensive data in enhancing algorithm accuracy.
2. Food Freshness Monitoring Systems
  • Overview: A practical application that employs AI to monitor the freshness of food items in your fridge. It utilizes image recognition and machine learning to detect signs of spoilage or expiration.
  • Data Requirement: This system relies on a comprehensive dataset of food items in various states of freshness, learning from visual cues to accurately predict when food might have gone wrong. Thus, it reduces waste and ensures health safety.
3. Conversational AI Revolutionized
  • Overview: Large Language Models (LLMs), like ChatGPT, Gemini, Claude, and others, are state-of-the-art language models developed by OpenAI that simulate human-like conversations, providing responses that can be indistinguishable from a human’s. It’s used in customer service, marketing, education, and entertainment.
  • Data Foundation: The development of LLMs required extensive training on diverse language data from books, websites, and other textual sources, highlighting the need for large, varied datasets to achieve nuanced understanding and generation of human language.
4. Synthetic Data Generation for AI Training
  • Overview: To address privacy concerns and the scarcity of certain types of training data, some AI projects are turning to synthetic data generation. This involves creating artificial datasets that mimic real-world data, enabling the continued development of AI without compromising privacy.
  • Application of Data: These projects illustrate the innovative use of algorithms to generate new data points, demonstrating how unique data needs push the boundaries of what’s possible in AI research and development.

What are Crawling Services and Platforms?

Crawling services and platforms are specialized software tools and infrastructure designed to navigate and index the content of websites across the internet systematically. These services work by visiting web pages, reading their content, and following links to other pages within the same or different websites, effectively mapping the web structure. The data collected through this process can include text, images, and other multimedia content, which is then used for various purposes, such as web indexing for search engines, data collection for market research, content aggregation for news or social media monitoring, and more. Crawling platforms often provide APIs or user interfaces to enable customized crawls based on specific criteria, such as keyword searches, domain specifications, or content types. This technology is fundamental for search engines to provide up-to-date results and for businesses and researchers to gather and analyze web data at scale.

Here are some practical examples to enhance your understanding of the concept:

1. Common Crawl
  • Overview: Common Crawl is a nonprofit organization that offers a massive archive of web-crawled data. It crawls the web at scale, providing access to petabytes of data, including web pages, links, and metadata, all freely available to the public.
  • Utility for Data Acquisition: Common Crawl is instrumental for researchers, companies, and developers looking to analyze web data at scale without deploying their own crawlers, thus democratizing access to large-scale web data.
2. Bright Data (Formerly Luminati)
  • Overview: Bright Data is recognized as one of the leading web data platforms, offering comprehensive web scraping and data collection solutions. It provides tools for both code-driven and no-code data collection, catering to various needs from simple data extraction to complex data intelligence.
  • Features and Applications: With its robust infrastructure, including a vast proxy network and advanced data collection tools, Bright Data enables users to scrape data across the internet ethically. It supports various use cases, from market research to competitive analysis, ensuring compliance and high-quality data output.
3. Developer Tools: Playwright, Puppeteer and Selenium
  • Overview: For those seeking a more hands-on approach to web scraping, developer tools like Playwright, Puppeteer, and Selenium offer frameworks for automating browser environments. These tools are essential for developers building custom crawlers that programmatically navigate and extract data from web pages.
  • Use in Data Collection: By leveraging these tools, developers can create sophisticated scripts that mimic human navigation patterns, bypass captcha challenges, and extract specific data points from complex web pages, enabling precise and targeted data collection strategies.
4. No-Code Data Collection Platforms
  • Overview: Recognizing the demand for simpler, more accessible data collection methods, several platforms now offer no-code solutions that allow users to scrape and collect web data without writing a single line of code.
  • Impact on Data Acquisition: These platforms lower the barrier to entry for data collection, making it possible for non-technical users to gather data for analysis, market research, or content aggregation, further expanding the pool of individuals and organizations that can leverage web data.
Examples of No-Code Data Collection Platforms

1. ParseHub

  • Description: ParseHub is a powerful and intuitive web scraping tool that allows users to collect data from websites using a point-and-click interface. It can handle websites with JavaScript, redirects, and AJAX.
  • Website: https://www.parsehub.com/

3. WebHarvy

  • Description: WebHarvy is a visual web scraping software that can automatically scrape images, texts, URLs, and emails from websites using a built-in browser. It’s designed for users who prefer a visual approach to data extraction.
  • Website: https://www.webharvy.com/

4. Import.io

  • Description: Import.io offers a more comprehensive suite of data integration tools and web scraping capabilities. It allows no-code data extraction from web pages and can transform and integrate this data with various applications.
  • Website: https://www.import.io/

5. DataMiner

  • Description: DataMiner is a Chrome and Edge browser extension that allows you to scrape data from web pages and into various file formats like Excel, CSV, or Google Sheets. It offers pre-made data scraping templates and a point-and-click interface to select the data you want to extract.
  • Website: Find it on the Chrome Web Store or Microsoft Edge Add-ons

These platforms vary in capabilities, from simple scraping tasks to more complex data extraction and integration functionalities, catering to a wide range of user needs without requiring coding skills.

5. Other great web scraping tool options include

1. Apify

  • Description: Apify is a cloud-based web scraping and automation platform that utilizes Puppeteer, Playwright, and other technologies to extract data from websites, automate workflows, and integrate with various APIs. It offers a ready-to-use library of actors (scrapers) for everyday tasks and allows users to develop custom solutions.
  • Website: https://apify.com/

2. ScrapingBee

  • Description: ScrapingBee is a web scraping API that handles headless browsers and rotating proxies, allowing users to scrape challenging websites easily. It supports both Puppeteer and Playwright, enabling developers to execute JavaScript-heavy scraping tasks without getting blocked.
  • Website: https://www.scrapingbee.com/

3. Browserless

  • Description: Browserless is a cloud service that provides a scalable and reliable way to run Puppeteer and Playwright scripts in the cloud. It’s designed for developers and businesses needing to automate browsers at scale for web scraping, testing, and automation tasks without managing their browser infrastructure.
  • Website: https://www.browserless.io/

4. Octoparse

  • Description: While Octoparse itself is primarily a no-code web scraping tool, it provides advanced options that allow integration with custom scripts, potentially incorporating Puppeteer or Playwright for specific data extraction tasks, especially when dealing with websites that require interaction or execute complex JavaScript.
  • Website: https://www.octoparse.com/

5. ZenRows

  • Description: ZenRows is a web scraping API that simplifies the process of extracting web data and handling proxies, browsers, and CAPTCHAs. It supports Puppeteer and Playwright, making it easier for developers to scrape data from modern web applications that rely heavily on JavaScript.
  • Website: https://www.zenrows.com/

Looking to the Future

As AI technologies like ChatGPT and DALL-E 3 continue to evolve, powered by vast amounts of data, researchers have raised concerns about a potential shortage of high-quality training data by 2026. This scarcity could impede the growth and effectiveness of AI systems, given the need for large, high-quality datasets to develop accurate and sophisticated algorithms. High-quality data is crucial for avoiding biases and inaccuracies in AI outputs, as seen in cases where AI has replicated undesirable behaviors from low-quality training sources. To address this impending data shortage, the industry could turn to improved AI algorithms to better use existing data, generate synthetic data, and explore new sources of high-quality content, including negotiating with content owners for access to previously untapped resources. These strategies aim to sustain the development of AI technologies and mitigate ethical concerns by potentially offering compensation for the use of creators’ content.

Looking to the future, the importance of data, likened to the new black gold, is poised to grow exponentially, heralding a future prosperous with innovation and opportunity. Anticipated advancements in data processing technologies, such as quantum and edge computing, promise to enhance the efficiency and accessibility of data analytics, transforming the landscape of information analysis. The emergence of synthetic data stands out as a groundbreaking solution to navigate privacy concerns, enabling the development of AI and machine learning without compromising individual privacy. These innovations indicate a horizon brimming with potential for transformative changes in collecting, analyzing, and utilizing data.

However, the true challenge and opportunity lie in democratizing access to this vast wealth of information, ensuring that the benefits of data are not confined to a select few but are shared across the global community. Developing equitable data-sharing models and open data initiatives will be crucial in leveling the playing field, offering startups, researchers, and underrepresented communities the chance to participate in and contribute to the data-driven revolution. As we navigate this promising yet complex future, prioritizing ethical considerations, transparency, and the responsible use of data will be paramount in fostering an environment where innovation and opportunity can flourish for all, effectively addressing the challenges of data scarcity and shaping a future enriched by data-driven advancements.

Conclusion

The elevation of data to the status of the most valuable asset in technology marks a pivotal transformation in our global economy and society. This shift reflects a more profound change in our collective priorities, recognizing data’s immense potential for catalyzing innovation, driving economic expansion, and solving complex challenges. However, with great power comes great responsibility. As we harness this new black gold, our data-driven endeavors’ ethical considerations and societal impacts become increasingly significant. Ensuring that the benefits of data are equitably distributed and that privacy, security, and ethical use are prioritized is essential for fostering trust and sustainability in technological advancement.

We encounter unparalleled opportunities and profound challenges in navigating the future technology landscape powered by the vast data reserves. The potential for data to improve lives, streamline industries, and open new frontiers of knowledge is immense. Yet, this potential must be balanced with vigilance against the risks of misuse, bias, and inequality arising from unchecked data proliferation. Crafting policies, frameworks, and technologies that safeguard individual rights while promoting innovation will be crucial in realizing the full promise of data. Collaborative efforts among governments, businesses, and civil society to establish norms and standards for data use can help ensure that technological progress serves the broader interests of humanity.

As we look to the future, the journey of data as the cornerstone of technological advancement is only beginning. Exploring this new black gold will continue to reshape our world, offering pathways to previously unimaginable possibilities. Yet, the accurate measure of our success in this endeavor will not be in the quantity of data collected or the sophisticated algorithms developed but in how well we leverage this resource to enhance human well-being, foster sustainable development, and bridge the divides that separate us. In this endeavor, our collective creativity, ethical commitment, and collaborative spirit will be our most valuable assets, guiding us toward a future where technology, powered by data, benefits all of humanity.

That’s it for today!

Sources

https://www.frontiersin.org/articles/10.3389/fsurg.2022.862322/full

Researchers warn we could run out of data to train AI by 2026. What then? (theconversation.com)

(138) The Business Case for AI Data Analytics in 2024 – YouTube

OpenAI Asks Public for More Data to Train Its AI Models (aibusiness.com)

Navigating the New Era: Development of Systems Guided by Generative AI

Generative Artificial Intelligence (AI) stands at the forefront of technological innovation, pushing the boundaries of what machines can achieve. It learns from existing artifacts to generate new, realistic creations, scaling up the volume while maintaining the essence of the original data without mere replication. The spectrum of novel content that Generative AI can produce spans images, videos, music, speech, text, software code, and product designs. The backbone of Generative AI lies in foundation models, which are nurtured on a vast dataset and further fine-tuned for specific tasks. Although the complexity of math and computing power required is immense, the core remains to be prediction algorithms.

Generative AI is gradually becoming a household name, thanks to platforms like ChatGPT by OpenAI, which exhibits human-like interactions, and DALL-E, which generates images from text descriptions. As per Gartner, Generative AI is on the trajectory to become a general-purpose technology with an impact echoing the likes of steam engines, electricity, and the internet.

What does Gartner predict for the future of generative AI use?

Generative AI is primed to make an increasingly strong impact on enterprises over the next five years. Gartner predicts that:

By 2024, 40% of enterprise applications will have embedded conversational AI, up from less than 5% in 2020.

By 2025, 30% of enterprises will have implemented an AI-augmented development and testing strategy, up from 5% in 2021.

By 2026, generative design AI will automate 60% of the design effort for new websites and mobile apps.

By 2026, over 100 million humans will engage colleagues to contribute to their work.

By 2027, nearly 15% of new applications will be automatically generated by AI without a human in the loop. This is not happening at all today.

Which sectors are being impacted by the development of systems with Generative AI?

  1. Healthcare:
    • Drug Discovery: Generative AI is revolutionizing the pharmaceutical landscape by expediting the drug discovery process. It can predict new compounds’ effectiveness and potential side effects, significantly reducing the time and costs of bringing a new drug to market. Moreover, Generative AI can help create synthetic molecular structures that could be groundbreaking cures for various diseases.
    • Medical Imaging and Diagnosis: Generative AI also plays a pivotal role in medical imaging and diagnostics. It can generate synthetic medical images to augment datasets, which is invaluable for training machine learning models, especially when real-world data is scarce or sensitive. Besides, it can assist in detecting and diagnosing diseases by analyzing medical images.
  2. Automotive and Aerospace:
    • Generative Design: In industries like automotive and aerospace, generative design powered by AI is a game-changer. It allows engineers to input design goals and constraints into a generative design software, which then explores all possible permutations of a solution, quickly generating design alternatives. It tests and learns from each iteration what works and what doesn’t to meet the design objectives.
    • Simulation and Testing: Generative AI can create realistic simulation environments, which are crucial for testing and validating autonomous driving systems or new aerospace technologies before they are deployed in real-world scenarios.
  3. Finance:
    • Risk Analysis and Fraud Detection: By modeling complex financial systems, Generative AI helps in risk analysis and fraud detection. It can generate synthetic data to stress-test various scenarios, which is imperative for financial institutions to remain resilient against economic uncertainties.
    • Algorithmic Trading: Generative AI can also be harnessed to develop sophisticated algorithmic trading strategies. It can generate predictive models to identify trading opportunities by analyzing vast financial data.
  4. Marketing:
    • Content Generation: The marketing realm is being reshaped with Generative AI’s ability to create compelling content. From drafting initial copy to generating personalized advertising, it’s enabling marketers to engage with their audience on a new level.
    • Customer Insights: Generative AI can dive into vast datasets to unearth insights into customer behavior and preferences, which can be harnessed to tailor marketing strategies effectively.
  5. Intellectual Property (IP):
    • Automated Patent Analysis: Generative AI can automate the analysis of vast patent datasets, helping to identify patent trends, assess the novelty of inventions, and even predict future technological advancements. This automated analysis can significantly speed up the patent granting process and help organizations stay ahead in the IP landscape.
    • Design Generation: In the domain of design patents, Generative AI can assist in creating novel designs or variations of existing designs at an unimaginable pace. However, this raises critical questions about the ownership and originality of the generated designs, nudging the IP sector to redefine its boundaries.
  6. Legal:
    • Legal Research and Document Review: Generative AI can automate legal research and document review tasks. By quickly analyzing vast amounts of legal texts, case laws, and precedents, it can provide lawyers with relevant information, saving precious time and resources.
    • Contract Generation and Analysis: The creation and analysis of legal contracts are other areas where Generative AI is making a significant impact. It can generate contract drafts based on the input parameters and analyze existing contracts to ensure compliance with the requisite legal standards.
    • Predictive Analysis: Moreover, Generative AI can be used for predictive analysis in legal scenarios, helping forecast legal dispute outcomes based on historical data. This could provide legal practitioners with valuable insights to strategize their cases better.
    • Legal Chatbots: Generative AI-powered legal chatbots can provide initial legal advice based on the query fed to them. They can understand the legal issue and provide a basic understanding of the legal stance, aiding in better client engagement and filtering.

    Each of these sectors exemplifies the profound impact and the boundless potential of Generative AI. By automating and augmenting various processes, Generative AI is not only driving efficiency and cost-savings but is also opening doors to new possibilities that were once deemed unattainable.

    The Developer’s New Playground

    In the wake of a technological renaissance, where artificial intelligence (AI) is the linchpin of modern innovation, the traditional silhouette of a developer’s career is undergoing a remarkable transformation. The advent of AI-infused systems is not just a fleeting trend but a seismic shift, nudging developers into a new epoch where their roles transcend the conventional boundaries of code and algorithms. This transition is not merely about adapting to new tools or languages but embracing a holistic metamorphosis, redefining what it means to be a developer. Here, we delve into the kaleidoscope of changes, painting the developer’s journey with new shades of challenges, learning, and opportunities.

    Morphing Roles and Skillsets:

    1. From Coders to Solution Architects: The new era nudges developers from mere coders to solution architects, orchestrating AI-driven solutions that address real-world problems.
    2. Interdisciplinary Proficiency: A developer’s role now demands a confluence of skills, including data science, machine learning, and understanding of domain-specific challenges.
    3. Ethical and Responsible AI Development: Developers are at the helm of ensuring that AI systems are built with a framework of ethics, transparency, and accountability.

    With Generative AI, developers are stepping into an expansive playground. They can now focus on crafting high-level objectives while the AI handles the detailed design. This speeds up the development process and opens up creativity and innovation.

    Continuous Learning: The New Norm

    In the fast-paced realm of technology, staying updated is not a choice but a necessity. This truth resonates even louder in Generative Artificial Intelligence (Generative AI), a domain continuously evolving, expanding, and surprising us with its potential. For developers, riding the wave of Generative AI is not about catching up but constantly sailing along, learning, and adapting. As Generative AI continues to redefine the contours of what’s possible in system development, a culture of continuous learning emerges as the new norm for developers. This isn’t merely about acquiring new skills; it’s about fostering a mindset of perpetual growth and curiosity.

    Why Continuous Learning?

    1. Staying Relevant:
      In a rapidly changing field, staying updated with the latest advancements is crucial for developers to remain relevant and competitive in their careers.
    2. Harnessing Full Potential:
      Continuous learning enables developers to harness the full potential of Generative AI, ensuring they can leverage their projects’ latest features and capabilities.
    3. Problem-Solving:
      With each new learning, developers expand their problem-solving toolkit, equipping themselves to tackle complex challenges innovatively.
    4. Ethical and Responsible AI Development:
      As Generative AI advances, so do the ethical considerations surrounding its use. Continuous learning is imperative to ensure responsible and ethical AI development.

    The Path of Continuous Learning:

    1. Online Courses and Certifications:
      Numerous online platforms offer courses and certifications on Generative AI and related technologies, facilitating continuous learning.
    2. Community Engagement:
      Engaging with the AI community, participating in forums, and attending conferences are excellent ways to learn from peers and stay updated.
    3. Practical Application:
      Applying learned concepts in real-world projects is a powerful way to reinforce learning and gain practical experience.
    4. Reading and Research:
      Regularly reading research papers, blogs, and articles in the domain can provide insights into the latest advancements and best practices.

    Conclusion

    Generative AI transcends the conventional role of a tool; it emerges as a formidable collaborator, amplifying developers’ creative and problem-solving prowess. The journey with Generative AI is akin to navigating through an expansive realm of innovation, where each step forward unveils new horizons of possibilities. As elucidated, the rapid evolution of Generative AI beckons a culture of continuous learning among developers, a requisite not merely to remain relevant but to excel and innovate in this dynamic landscape.

    As Generative AI continues to percolate through various sectors, notably intellectual property and legal domains, its harmonization with modern development systems is not a fleeting trend but a profound shift. Understanding and adapting to Generative AI isn’t just beneficial; it’s quintessential for developers to harness this technology’s burgeoning potential fully. The narrative is not about optional adaptation but essential evolution to foster a synergistic alliance with Generative AI.

    The infusion of Generative AI in modern development systems isn’t merely a technical enhancement; it’s a paradigm shift towards a more collaborative, innovative, and continuously evolving development ecosystem. As developers, embracing this shift is synonymous with stepping into a future of endless exploration, innovation, and growth. The ripple effects of this fusion are significant, reshaping not just how systems are developed but how developers evolve in their careers, continuously learn, and contribute to the broader narrative of technological advancement.

    As Generative AI finds its footing in more sectors, the symbiotic relationship between it and developers will be the linchpin for unlocking new dimensions of innovation, solving complex problems, and creating value in unprecedented ways. Hence, understanding and adapting to Generative AI is not a mere advantage; it’s a cornerstone for thriving in the modern development landscapes increasingly becoming intertwined with intelligent and creative computational counterparts.

    That’s it for today!

    Sources

    Generative AI: What Is It, Tools, Models, Applications and Use Cases (gartner.com)