Data Analytics – 💡Tech News & Insights

Implementing Data Governance in Microsoft Fabric: A Step-by-Step Guide

Information is arguably an organization’s most valuable asset in today’s data-driven world. However, without proper management, this asset can quickly become a liability. Microsoft Fabric, a revolutionary unified analytics platform, integrates everything from data engineering and data science to data warehousing and business intelligence into a single, SaaS-based environment. It provides powerful tools to store, process, analyze, and visualize vast data. But with great power comes great responsibility. To maintain trust, ensure security, uphold data quality, and meet ever-increasing compliance demands, implementing a robust data governance framework within Fabric isn’t just recommended—it’s essential.

Effective data governance ensures that data remains accurate, secure, consistent, and usable throughout its entire lifecycle, aligning technical capabilities with strategic business goals and stringent regulatory requirements like GDPR, HIPAA, or CCPA. Within the Fabric ecosystem, this translates to leveraging its built-in governance features and its seamless integration with Microsoft Purview, Microsoft’s comprehensive data governance and compliance suite. The goal is to effectively manage and protect sensitive information while empowering users, from data engineers and analysts to business users and compliance officers, to confidently discover, access, and derive value from data within well-defined, secure guardrails.

A well-designed governance plan in Fabric strikes a critical balance between enabling user productivity and innovation and enforcing necessary controls for compliance and risk mitigation. It’s about establishing clear policies, defining roles and responsibilities, and implementing consistent processes so that, as the adage goes, “the right people can take the right actions with the right data at the right time”. This guide provides a practical, step-by-step approach to implementing such a framework within Microsoft Fabric, leveraging its native capabilities and Purview integration to build a governed, trustworthy data estate.

The Critical Importance of Data Governance

Data governance is more than just an IT buzzword or a compliance checkbox; it is a fundamental strategic imperative for any organization looking to leverage its data assets effectively and responsibly. The need for robust governance becomes even more pronounced in the context of a powerful, unified platform like Microsoft Fabric, which brings together diverse data workloads and user personas. Implementing strong data governance practices yields numerous critical benefits:

Ensuring Data Quality and Consistency: Governance establishes standards and processes for creation, maintenance, and usage, leading to more accurate, reliable, and consistent data across the organization. This is crucial for trustworthy analytics and informed decision-making. Poor data quality can lead to flawed insights, operational inefficiencies, and loss of credibility.
Enhancing Data Security and Protection: A core function of governance is to protect sensitive data from unauthorized access, breaches, or misuse. By defining access controls, implementing sensitivity labeling (using tools like Microsoft Purview Information Protection), and enforcing security policies, organizations can safeguard confidential information, protect intellectual property, and maintain customer privacy.
Meeting Regulatory Compliance Requirements: Organizations operate under a complex web of industry regulations and data privacy laws (such as GDPR, CCPA, HIPAA, SOX, etc.). Data governance provides the framework, controls, and audit trails necessary to demonstrate compliance, avoid hefty fines, and mitigate legal risks. Features like data lineage and auditing in Fabric, often powered by Purview, are essential.
Improving Data Discoverability and Usability: A well-governed data estate makes it easier for users to find the data they need. Features like the OneLake data hub, data catalogs, business glossaries, endorsements (certifying or promoting assets), and descriptive metadata help users quickly locate relevant, trustworthy data, fostering reuse and reducing redundant data preparation efforts.
Building Trust and Confidence: When users know that data is well-managed, secure, and accurate, they have greater confidence in the insights derived from it. This trust is foundational for fostering a data-driven culture where decisions are based on reliable evidence.
Optimizing Operational Efficiency: Governance helps streamline data-related processes, reduce data duplication, clarify ownership, and improve team collaboration. This leads to increased efficiency, reduced costs for managing poor-quality or redundant data, and faster time-to-insight.
Enabling Scalability and Innovation: While governance involves controls, it also provides the necessary structure to manage data effectively as volumes and complexity grow. A solid governance foundation allows organizations to innovate confidently, knowing their data practices are sound and scalable.

Data governance transforms data from a potential risk into a reliable, strategic asset, enabling organizations to maximize their value while minimizing associated risks within the Microsoft Fabric environment.

An Overview of Microsoft Fabric

Understanding the platform itself is helpful before diving into the specifics of governance implementation. Microsoft Fabric represents a significant evolution in the analytics landscape, offering an end-to-end, unified platform delivered as a Software-as-a-Service (SaaS) solution. It aims to simplify analytics for organizations by combining disparate data tools and services into a single, cohesive environment built around a central data lake called OneLake.

Fabric integrates various data and analytics workloads, often referred to as “experiences,” which traditionally required separate, usually complex, integrations:

Data Factory: Provides data integration capabilities for ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes, enabling data movement and transformation at scale.
Synapse Data Engineering: A Spark-based large-scale data transformation and preparation platform primarily uses notebooks.
Synapse Data Science: Provides an end-to-end workflow for data scientists to build, deploy, and manage machine learning models.
Synapse Data Warehousing: Delivers a next-generation SQL engine for traditional data warehousing workloads, offering high performance over open data formats.
Synapse Real-Time Analytics: This technology enables the real-time analysis of data streaming from various sources, such as IoT devices and logs.
Power BI: The well-established business intelligence and visualization service, fully integrated for reporting and analytics.
Data Activator: A no-code experience for monitoring data and triggering actions based on detected patterns or conditions.

Shortcuts allow your organization to easily share data between users and applications without unnecessarily moving and duplicating information. When teams work independently in separate workspaces, shortcuts enable you to combine data across different business groups and domains into a virtual data product to fit a user’s specific needs.

A shortcut is a reference to data stored in other file locations. These file locations can be within the same workspace or across different workspaces, within OneLake or external to OneLake in ADLS, S3, or Dataverse, with more target locations coming soon. No matter the location, shortcuts make files and folders look like you have stored locally. For more information on how to use shortcuts, see OneLake shortcuts.

Underpinning all these experiences is OneLake, Fabric’s built-in, tenant-wide data lake. OneLake eliminates data silos by providing a single, unified storage system for all data within Fabric, regardless of which experience created or uses it. It’s built on Azure Data Lake Storage Gen2. Still, it adds shortcuts (allowing data to be referenced without moving or duplicating it) and a unified namespace, simplifying data management and access.

This unified architecture has profound implications for governance. By centralizing data storage (OneLake) and providing a familiar administrative interface (Fabric Admin Portal), Fabric facilitates the application of consistent governance policies, security controls, and monitoring across the entire analytics lifecycle. Features like sensitivity labels and lineage can often propagate automatically across different Fabric items, simplifying the task of governing a complex data estate. Understanding this integrated nature is key to effectively implementing governance within the platform.

Understanding Microsoft Purview: The Governance Foundation

While Microsoft Fabric provides the unified analytics platform, Microsoft Purview is the overarching data governance, risk, and compliance solution that integrates deeply with Fabric to manage and protect the entire data estate. Understanding Purview’s role is crucial for implementing effective governance in Fabric.

Microsoft Purview is a family of solutions designed to help organizations govern, protect, and manage data across their entire landscape, including Microsoft 365, on-premises systems, multi-cloud environments, and SaaS applications like Fabric. Its key capabilities relevant to Fabric governance include:

Unified Data Catalog: Purview automatically discovers and catalogs Fabric items (like lakehouses, warehouses, datasets, reports) alongside other data assets. It creates an up-to-date map of the data estate, enabling users to easily find and understand data through search, browsing, and business glossary terms.
Data Classification and Sensitivity Labels: Through integration with Microsoft Purview Information Protection, Purview allows organizations to define sensitivity labels (e.g., Confidential, PII) and apply them consistently across Fabric items. This classification helps identify sensitive data and drives protection policies.

End-to-End Data Lineage: Purview provides visualization of data lineage, showing how data flows and transforms from its source through various Fabric processes (e.g., Data Factory pipelines, notebooks) down to Power BI reports. This is vital for impact analysis, troubleshooting, and demonstrating compliance.
Data Loss Prevention (DLP): Purview DLP policies can be configured (currently primarily for Power BI semantic models within Fabric) to detect sensitive information based on classifications or patterns (like credit card numbers) and prevent its unauthorized sharing or exfiltration, providing alerts and policy tips.
Auditing: All user and administrative activities within Fabric are logged and made available through Microsoft Purview Audit, providing a comprehensive trail for security monitoring and compliance investigations.
Purview Hub in Fabric: This centralized page within the Fabric experience provides administrators and governance stakeholders with insights into their Fabric data estate, including sensitivity labeling coverage, endorsement status, and a gateway to the broader Purview governance portal.

Purview is the central governance plane that overlays Fabric (and other data sources), providing the tools to define policies, classify data, track lineage, enforce protection, and consistently monitor activities. The seamless integration ensures that as data moves and transforms within Fabric, the governance context (like sensitivity labels and lineage) is maintained, enabling organizations to build a truly governed and trustworthy analytics environment.

https://learn.microsoft.com/en-us/purview/data-governance-overview

Step-by-Step Process for Implementing Data Governance in Microsoft Fabric

Implementing data governance in Microsoft Fabric is a phased process that involves defining policies, configuring technical controls, assigning responsibilities, and establishing ongoing monitoring. Here’s a practical step-by-step guide:

Step 1: Define Your Governance Policies and Framework

Before configuring any tools, establish the foundation – your governance framework. This involves defining the rules, standards, and responsibilities that will guide data handling within Fabric.

Identify Stakeholders and Requirements: Assemble a cross-functional team including representatives from IT, data management, legal, compliance, and key business units. Collaboratively identify all applicable external regulations (e.g., GDPR, HIPAA, or CCPA) and internal business requirements (e.g., data quality standards, retention policies, ethical use guidelines). Understanding these requirements is crucial for tailoring your policies.
Develop Data Classification Policies: Define clear data sensitivity levels (e.g., Public, Internal, Confidential, Highly Restricted). Map these levels to Microsoft Purview Information Protection sensitivity labels. Establish clear policies detailing how data in each classification level must be handled regarding access, sharing, encryption, retention, and disposal. For example, it mandates that all data classified as “Highly Restricted” must be encrypted and access restricted to specific roles. https://learn.microsoft.com/en-us/purview/sensitivity-labels
Configure Tenant Settings via Admin Portal: Fabric administrators should configure tenant-wide governance settings in the Fabric Admin Portal. This includes defining who can create workspaces, setting default sharing behaviors, enabling auditing, configuring capacity settings, and potentially restricting specific Fabric experiences. Many settings can be delegated to domain or capacity admins, where appropriate, for more granular control. Consider licensing requirements for advanced Purview features like automated labeling or DLP. https://learn.microsoft.com/en-us/fabric/admin/about-tenant-settings
Document and Communicate: Document all governance policies, standards, and procedures. Make this documentation easily accessible to all Fabric users. Communicate the policies effectively, explaining their rationale and clarifying user responsibilities. Assign clear accountability for policy enforcement, often involving data stewards, data owners, and workspace administrators.

Step 2: Establish Roles and Access Controls (RBAC)

With policies defined, implement Role-Based Access Control (RBAC) to enforce them.

Utilize Workspace Roles: Assign users or (preferably) security groups to Fabric workspace roles (Admin, Member, Contributor, Viewer) based on the principle of least privilege. Understand the permissions associated with each role to ensure users only have the access necessary for their jobs. https://learn.microsoft.com/en-us/fabric/fundamentals/roles-workspaces
Leverage Security Groups: Manage access primarily through Microsoft Entra ID (formerly Azure AD) security groups rather than individual user assignments. This simplifies administration as team memberships change.
Assign Admin Roles: Carefully assign higher-level administrative roles: Fabric Administrator (tenant-wide), Domain Administrator (for specific business areas), and Capacity Administrator (for managing compute resources)—delegate responsibilities where appropriate to distribute the governance workload. https://learn.microsoft.com/en-us/fabric/admin/roles
Implement Granular Security: For sensitive data within datasets, lakehouses, or warehouses, implement Row-Level Security (RLS) and Object-Level Security (OLS) (e.g., column-level or table/view permissions) to restrict access at a finer grain than workspace roles allow. https://learn.microsoft.com/en-us/fabric/data-warehouse/row-level-security, https://learn.microsoft.com/en-us/fabric/security/service-admin-row-level-security and https://learn.microsoft.com/en-us/fabric/data-warehouse/column-level-security
Establish Access Review Processes: Implement procedures for requesting, approving, and periodically reviewing access permissions, especially for sensitive data or privileged roles. Maintain logs of approvals for audit purposes.

Step 3: Configure Workspaces and Domains

Organize your Fabric environment logically to support governance.

Structure Domains: Group workspaces into logical domains, typically aligned with business units or subject areas (e.g., Finance, Marketing, Product Analytics). This facilitates delegated administration and helps users discover relevant data. https://learn.microsoft.com/en-us/fabric/governance/domains
Organize Workspaces: Within domains, organize workspaces based on purpose (e.g., project, team) or environment (Development, Test, Production). Use clear naming conventions and descriptions. Assign workspaces to the appropriate domain. https://learn.microsoft.com/en-us/fabric/fundamentals/workspaces
Apply Workspace Settings: Configure settings within each workspace, such as contact lists, license modes (Pro, PPU, Fabric capacity), and connections to resources like Git for version control, aligning them with your governance policies.
Consider Lifecycle Management: Use separate workspaces and potentially Fabric deployment pipelines to manage content promotion from development through testing to production, ensuring only validated assets reach end-users. https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/understand-the-deployment-process?tabs=new-ui

Step 4: Implement Data Protection and Security Measures

Actively protect your data assets using built-in and integrated tools.

Apply Sensitivity Labels: Implement the data classification policy by applying Microsoft Purview Information Protection sensitivity labels to Fabric items (datasets, reports, lakehouses, etc.). Use a combination of manual labeling by users, default labeling on workspaces or items, and automated labeling based on sensitive information types detected by Purview scanners. Ensure label inheritance policies are configured appropriately. https://learn.microsoft.com/en-us/power-bi/enterprise/service-security-enable-data-sensitivity-labels
Configure Data Loss Prevention (DLP) Policies: Define and enable Microsoft Purview DLP policies specifically for Power BI (and potentially other Fabric endpoints as capabilities expand) to detect and prevent the inappropriate sharing or exfiltration of sensitive data identified by sensitivity labels. (Note: Requires specific Purview licensing.) https://learn.microsoft.com/en-us/fabric/governance/data-loss-prevention-configure
Leverage Encryption: Understand and utilize Fabric’s encryption capabilities, including encryption at rest (often managed by the platform) and potentially customer-managed keys (CMK) for enhanced control over encryption if required. https://learn.microsoft.com/en-us/fabric/security/security-scenario

Step 5: Enable Monitoring and Auditing

Visibility into data usage and governance activities is crucial.

Utilize the Monitoring Hub to monitor the status and history of Fabric activities like pipeline runs, Spark jobs, and dataflow refreshes. https://learn.microsoft.com/en-us/fabric/admin/monitoring-hub
Analyze Capacity Metrics: Use the Fabric Capacity Metrics app to monitor resource utilization, performance, and potential throttling, helping optimize costs and ensure service levels. https://learn.microsoft.com/en-us/fabric/enterprise/metrics-app
Leverage the Purview Hub: Use the integrated Purview Hub within Fabric as a central point for administrators and data owners to view governance insights, manage metadata, and track compliance status. https://learn.microsoft.com/en-us/fabric/governance/use-microsoft-purview-hub?tabs=overview
Enable and Review Audit Logs: Ensure Fabric auditing is enabled and integrated with the Microsoft Purview compliance portal. Regularly review audit logs to track user activities, access patterns, policy changes, and potential security incidents. https://learn.microsoft.com/en-us/fabric/admin/track-user-activities
Use Admin Monitoring Workspace: Fabric administrators can use the dedicated monitoring workspace to gain insights into tenant-wide activities, usage patterns, and adoption. https://learn.microsoft.com/en-us/fabric/admin/monitoring-workspace

Step 6: Foster Data Discovery, Trust, and Reuse

Governance should also empower users by making trustworthy data easily accessible.

Promote the OneLake Data Hub: Encourage users to utilize the OneLake Data Hub to discover existing data items across the tenant, filtering by domain, keywords, or endorsement status. https://learn.microsoft.com/en-us/fabric/governance/onelake-catalog-overview and https://learn.microsoft.com/en-us/fabric/governance/onelake-catalog-govern
Implement Endorsement: Establish a straightforward process for promoting and certifying high-quality, reliable data assets (datasets, dataflows, reports). Clearly define the criteria for certification and authorize specific reviewers. This builds user trust in endorsed assets. https://learn.microsoft.com/en-us/fabric/governance/endorsement-overview
Utilize Tags: Encourage using relevant tags on fabric items to improve organization and searchability beyond standard metadata. https://learn.microsoft.com/en-us/fabric/governance/tags-overview
Leverage Lineage and Impact Analysis: Use Fabric’s lineage view to understand data origins, transformations, and dependencies. Before changing upstream items, perform an impact analysis to understand potential effects on downstream reports or processes. https://learn.microsoft.com/en-us/fabric/governance/lineage
Build a Business Glossary: Integrate with Microsoft Purview Data Catalog to build and maintain a business glossary, consistently defining key terms and metrics across the organization and linking them to relevant Fabric assets. https://learn.microsoft.com/en-us/purview/legacy/how-to-create-manage-glossary-term

Step 7: Monitor, Iterate, and Optimize

Data governance is not a one-time project but an ongoing process.

Regularly Review and Audit: Periodically review governance policies, access controls, label usage, and audit logs to ensure effectiveness and identify areas for improvement. https://learn.microsoft.com/en-us/fabric/governance/governance-compliance-overview
Gather Feedback: Solicit feedback from users and stakeholders on the governance processes and tools.
Adapt and Update: Update policies and configurations based on audit findings, user feedback, changing regulations, and evolving business needs. Stay informed about new Fabric and Purview governance features.

By following these steps, organizations can establish a comprehensive and practical data governance framework within Microsoft Fabric, enabling them to harness the full power of the platform while maintaining control, security, and compliance.

Real-World Examples: Data Governance in Action

The principles and steps outlined above are not just theoretical; organizations are actively implementing robust data governance frameworks using Microsoft Fabric and Purview to overcome challenges and drive value. Let’s look at a couple of examples:

1. Microsoft’s Internal Transformation:

Microsoft itself faced significant hurdles with its vast and complex data estate. Data was siloed across various business units and managed inconsistently, making it difficult to gain a unified enterprise view. Governance was often perceived as a bottleneck, hindering the pace of digital transformation. Microsoft embarked on its data transformation journey, leveraging its tools to address this.

Their strategy involved building an enterprise data platform centered around Microsoft Fabric as the unifying analytics foundation and Microsoft Purview for governance. Fabric helped break down silos by providing a common platform (including OneLake) for data integration and analytics across diverse sources. Purview was then layered on top to enable responsible data democratization. This meant implementing controls like a shared data catalog and consistent policies, not to restrict access arbitrarily, but to enable broader, secure access to trustworthy data. A key cultural shift was viewing governance as an accelerator for transformation, facilitated by the unified data strategy and strong leadership alignment. The outcome is a more agile, regulated, and business-focused data environment that fuels faster decision-making and innovation.

2. Leading Financial Institution:

A leading bank operating in a highly regulated industry revolutionized its data governance with Microsoft Purview. While specific challenges aren’t detailed in the summary, typical banking concerns include operational efficiency, stringent compliance requirements (like GDPR), data security, and preventing sensitive data loss.

By implementing Purview, the bank achieved significant improvements. Operationally, automated data discovery and a centralized view allowed business users to find information faster and reduced manual effort in reporting. From a compliance perspective, Purview provided centralized metrics for monitoring the compliance posture and automated processes for classifying and tagging data according to regulations, strengthening overall security. Furthermore, implementing Data Loss Prevention (DLP) rules based on data sensitivity helped safeguard critical information and prevent unauthorized access or sharing. Purview acted as a unified platform, enhancing efficiency, visibility, security, and control over the bank’s data assets.

These examples illustrate how organizations, facing everyday challenges like data silos, compliance pressures, and the need for agility, are successfully using Microsoft Fabric and Purview to establish effective data governance. They highlight the importance of a unified data strategy, the role of tools in automating and centralizing controls, and the cultural shift towards viewing governance as an enabler of business value.

Conclusion

Microsoft Fabric offers a robust, unified platform for end-to-end analytics, but realizing its full potential requires a deliberate and comprehensive approach to data governance. As we’ve explored, implementing governance in Fabric is not merely about restricting access; it’s about establishing a framework that ensures data quality, security, compliance, and usability, fostering trust and enabling confident, data-driven decision-making across the organization.

The real-world examples, from Microsoft’s internal transformation to implementations in regulated industries like finance, demonstrate that these are not just theoretical concepts. Organizations are actively leveraging Fabric’s unified foundation and Purview’s comprehensive governance capabilities to overcome tangible challenges like data silos, inconsistent management, compliance burdens, and operational inefficiencies.

By integrating Fabric’s built-in features—such as the Admin Portal, domains, workspaces, RBAC, endorsement, and lineage—with the advanced capabilities of Microsoft Purview—including Information Protection sensitivity labels, Data Loss Prevention, auditing, and the unified data catalog—organizations can create a robust governance posture tailored to their specific needs.

The outlined step-by-step process provides a roadmap, but the journey requires more than technical implementation. Success hinges on several key factors, reinforced by real-world experience:

Key Recommendations for Success:

Strategic Alignment and Collaboration: As seen in Microsoft’s case, define clear governance objectives that are aligned with business goals before configuring tools. Data governance requires a cultural shift and strong leadership alignment. It’s a team effort involving IT, data, legal, compliance, and business units.
Leverage the Unified Platform (Fabric + Purview): Treat Fabric and Purview as an integrated solution. Use Fabric to unify the data estate and Purview to apply consistent governance controls across it, enabling responsible democratization and breaking down silos.
Prioritize Automation for Efficiency and Consistency: Automate governance tasks like sensitivity labeling, policy enforcement (DLP), and monitoring wherever possible. As the banking case study demonstrated, this reduces manual effort, ensures consistency, improves responsiveness, and boosts operational efficiency.
Focus on User Empowerment and Education: Balance control with usability. Provide clear documentation, training, and tools (like the OneLake Data Hub and Purview catalog) to help users understand policies, find trustworthy data, and comply with requirements – turning governance into an accelerator, not a blocker.
Implement Incrementally and Iterate: Data governance is an ongoing journey. Start with a pilot or focus on critical assets first. Monitor effectiveness, gather feedback, and continuously refine your approach based on evolving needs, regulations, and platform capabilities.

By taking a structured, collaborative, and tool-aware approach, informed by others’ successes, organizations can build a foundation of trust and control within Microsoft Fabric, transforming governance from a perceived burden into a strategic enabler that unlocks the actual value of their data.

Should you have any questions or need assistance about Microsoft Fabric or Microsoft Purview, please don’t hesitate to contact me using the provided link: https://lawrence.eti.br/contact/

That’s it for today!

Sources

Implementing Data Governance in Power BI (Lawrence ETI Blog) – https://lawrence.eti.br/2023/01/29/implementing-data-governance-in-power-bi-a-step-by-step-guide/
Microsoft Learn: Governance and compliance in Microsoft Fabric – https://learn.microsoft.com/en-us/fabric/governance/governance-compliance-overview
Microsoft Learn: Govern your Fabric data with the OneLake catalog – https://learn.microsoft.com/en-us/fabric/governance/onelake-catalog-govern
Microsoft Learn: What is Microsoft Fabric? – https://learn.microsoft.com/en-us/fabric/fundamentals/microsoft-fabric-overview
Microsoft Learn: OneLake, the OneDrive for data – https://learn.microsoft.com/en-us/fabric/onelake/onelake-overview
Microsoft Learn: Information protection in Microsoft Fabric – https://learn.microsoft.com/en-us/fabric/governance/information-protection
Microsoft Learn: What is the admin portal? – https://learn.microsoft.com/en-us/fabric/admin/admin-overview
Microsoft Learn: Workspace roles – https://learn.microsoft.com/en-us/fabric/get-started/roles-workspaces
Microsoft Learn: Admin roles – https://learn.microsoft.com/en-us/fabric/admin/roles
Microsoft Learn: Security overview – https://learn.microsoft.com/en-us/fabric/security/security-overview#data-security
Microsoft Learn: Domains – https://learn.microsoft.com/en-us/fabric/governance/domains
Microsoft Learn: Introduction to deployment pipelines – https://learn.microsoft.com/en-us/fabric/cicd/deployment-pipelines/intro-deployment-pipelines
Microsoft Learn: Learn about data loss prevention policies for Power BI – https://learn.microsoft.com/en-us/purview/dlp-powerbi-learn-about
Microsoft Learn: Encryption – https://learn.microsoft.com/en-us/fabric/security/security-encryption
Microsoft Learn: Monitoring hub – https://learn.microsoft.com/en-us/fabric/admin/monitoring-hub
Microsoft Learn: Microsoft Fabric Capacity Metrics – https://learn.microsoft.com/en-us/fabric/enterprise/metrics-app
Microsoft Learn: Use the Microsoft Purview hub – https://learn.microsoft.com/en-us/fabric/governance/use-microsoft-purview-hub
Microsoft Learn: Track user activities in Microsoft Fabric – https://learn.microsoft.com/en-us/fabric/admin/track-user-activities
Microsoft Learn: Admin monitoring – https://learn.microsoft.com/en-us/fabric/admin/admin-monitoring
Microsoft Learn: OneLake data hub – https://learn.microsoft.com/en-us/fabric/governance/onelake-catalog-overview
Microsoft Learn: Endorsement overview – https://learn.microsoft.com/en-us/fabric/governance/endorsement-overview
Microsoft Learn: Tags – https://learn.microsoft.com/en-us/fabric/governance/tags
Microsoft Learn: Lineage – https://learn.microsoft.com/en-us/fabric/governance/lineage
Microsoft Learn: Create and manage glossary terms – https://learn.microsoft.com/en-us/purview/how-to-create-edit-glossary-terms
YASH Technologies Blog: Microsoft Fabric Data Governance: What You Need Today – https://www.yash.com/blog/comprehensive-guide-to-data-governance-in-microsoft-fabric/
Microsoft Inside Track: Transforming Data Governance at Microsoft with Purview and Fabric – https://www.microsoft.com/insidetrack/blog/transforming-data-governance-at-microsoft-with-microsoft-purview-and-microsoft-fabric/
Kanerika Case Study: Achieving Data Governance Excellence with Microsoft Purview – https://kanerika.com/case-studies/revolutionizing-data-governance-for-a-leading-bank-with-microsoft-purview/

Integrating Azure OpenAI with Native Vector Support in Azure SQL Databases for Advanced Search Capabilities and Data Insights

Azure SQL Database has taken a significant step forward by introducing native support for vectors, unlocking advanced capabilities for applications that rely on semantic search, AI, and machine learning. By integrating vector search into Azure SQL, developers can now store, search, and analyze vector data directly alongside traditional SQL data, offering a unified solution for complex data analysis and enhanced search experiences.

Vectors in Azure SQL Database

Vectors are numerical representations of objects like text, images, or audio. They are essential for applications involving semantic search, recommendation systems, and more. These vectors are typically generated by machine learning models, capturing the semantic meaning of the data they represent.

The new vector functionality in Azure SQL Database allows you to store and manage these vectors within a familiar SQL environment. This eliminates the need for separate vector databases, streamlining your application architecture and simplifying your data management processes.

Key Benefits of Native Vector Support in Azure SQL

Unified Data Management: Store and query both traditional and vector data in a single database, reducing complexity and maintenance overhead.
Advanced Search Capabilities: Perform similarity searches alongside standard SQL queries, leveraging Azure SQL’s sophisticated query optimizer and powerful enterprise features.
Optimized Performance: Vectors are stored in a compact binary format, allowing for efficient distance calculations and optimized performance on vector-related operations.

Embeddings: The Foundation of Vector Search

At the heart of vector search are embeddings—dense vector representations of objects, generated by deep learning models. These embeddings capture the semantic similarities between related concepts, enabling tasks such as semantic search, natural language processing, and recommendation systems.

For example, word embeddings can cluster related words like “computer,” “software,” and “machine,” while distant clusters might represent words with entirely different meanings, such as “lion,” “cat,” and “dog.” These embeddings are particularly powerful in applications where context and meaning are more important than exact keyword matches.

Azure OpenAI makes it easy to generate embeddings by providing pre-trained machine learning models accessible through REST endpoints. Once generated, these embeddings can be stored directly in an Azure SQL Database, allowing you to perform vector search queries to find similar data points.

You can explore how vector embeddings work by visiting this amazing website: Transformer Explainer. It offers an excellent interactive experience to help you better understand how Generative AI operates in general.

Vector Search Use Cases

Vector search is a powerful technique used to find vectors in a dataset that are similar to a given query vector. This capability is essential in various applications, including:

Semantic Search: Rank search results based on their relevance to the user’s query.
Recommendation Systems: Suggest related items based on similarity in vector space.
Clustering: Group similar items together based on vector similarity.
Anomaly Detection: Identify outliers in data by finding vectors that differ significantly from the norm.
Classification: Classify items based on the similarity of their vectors to predefined categories.

For instance, consider a semantic search application where a user queries for “healthy breakfast options.” A vector search would compare the vector representation of the query with vectors representing product reviews, finding the most contextually relevant items—even if the exact keywords don’t match.

Key Features of Native Vector Support in Azure SQL

Azure SQL’s native vector support introduces several new functions to operate on vectors, which are stored in a binary format to optimize performance. Here are the key functions:

JSON_ARRAY_TO_VECTOR: Converts a JSON array into a vector, enabling you to store embeddings in a compact format.
ISVECTOR: Checks whether a binary value is a valid vector, ensuring data integrity.
VECTOR_TO_JSON_ARRAY: Converts a binary vector back into a human-readable JSON array, making it easier to work with the data.
VECTOR_DISTANCE: Calculates the distance between two vectors using a chosen distance metric, such as cosine or Euclidean distance.

These functions enable powerful operations for creating, storing, and querying vector data in Azure SQL Database.

Example: Vector Search in Action

Let’s walk through an example of using Azure SQL Database to store and query vector embeddings. Imagine you have a table of customer reviews, and you want to find reviews that are contextually related to a user’s search query.

Storing Embeddings as Vectors:
After generating embeddings using Azure OpenAI, you can store these vectors in a VARBINARY(8000) column in your SQL table:

SQL

   ALTER TABLE [dbo].[FineFoodReviews] ADD [VectorBinary] VARBINARY(8000);
   UPDATE [dbo].[FineFoodReviews]
   SET [VectorBinary] = JSON_ARRAY_TO_VECTOR([vector]);

This allows you to store the embeddings efficiently, ready for vector search operations.

Performing Similarity Searches:
To find reviews that are similar to a user’s query, you can convert the query into a vector and calculate the cosine distance between the query vector and the stored embeddings:

SQL

   DECLARE @e VARBINARY(8000);
   EXEC dbo.GET_EMBEDDINGS @model = '<yourmodeldeploymentname>', @text = 'healthy breakfast options', @embedding = @e OUTPUT;

   SELECT TOP(10) ProductId,
                  Summary,
                  Text,
                  VECTOR_DISTANCE('cosine', @e, VectorBinary) AS Distance
   FROM dbo.FineFoodReviews
   ORDER BY Distance;

This query returns the top reviews that are contextually related to the user’s search, even if the exact words don’t match.

Hybrid Search with Filters:
You can enhance vector search by combining it with traditional keyword filters to improve relevance and performance. For example, you could filter reviews based on criteria like user identity, review score, or the presence of specific keywords, and then apply vector search to rank the results by relevance:

SQL

   -- Comprehensive query with multiple filters.
   SELECT TOP(10)
       f.Id,
       f.ProductId,
       f.UserId,
       f.Score,
       f.Summary,
       f.Text,
       VECTOR_DISTANCE('cosine', @e, VectorBinary) AS Distance,
       CASE 
           WHEN LEN(f.Text) > 100 THEN 'Detailed Review'
           ELSE 'Short Review'
       END AS ReviewLength,
       CASE 
           WHEN f.Score >= 4 THEN 'High Score'
           WHEN f.Score BETWEEN 2 AND 3 THEN 'Medium Score'
           ELSE 'Low Score'
       END AS ScoreCategory
   FROM FineFoodReviews f
   WHERE
       f.UserId NOT LIKE 'Anonymous%'  -- Exclude anonymous users
       AND f.Score >= 2               -- Score threshold filter
       AND LEN(f.Text) > 50           -- Text length filter for detailed reviews
       AND (f.Text LIKE '%gluten%' OR f.Text LIKE '%dairy%') -- Keyword filter
   ORDER BY
       Distance,  -- Order by cosine distance
       f.Score DESC, -- Secondary order by review score
       ReviewLength DESC; -- Tertiary order by review length

This query combines semantic search with traditional filters, balancing relevance and computational efficiency.

Leveraging REST Services for Embedding Generation

Azure OpenAI provides REST endpoints for generating embeddings, which can be consumed directly from Azure SQL Database using the sp_invoke_external_rest_endpoint system stored procedure. This integration enables seamless interaction between your data and AI models, allowing you to build intelligent applications that combine the power of machine learning with the familiarity of SQL.

Here’s a stored procedure example that retrieves embeddings from a deployed Azure OpenAI model and stores them in the database:

SQL

CREATE PROCEDURE [dbo].[GET_EMBEDDINGS]
(
    @model VARCHAR(MAX),
    @text NVARCHAR(MAX),
    @embedding VARBINARY(8000) OUTPUT
)
AS
BEGIN
    DECLARE @retval INT, @response NVARCHAR(MAX);
    DECLARE @url VARCHAR(MAX);
    DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT('input': @text);

    SET @url = 'https://<resourcename>.openai.azure.com/openai/deployments/' + @model + '/embeddings?api-version=2023-03-15-preview';

    EXEC dbo.sp_invoke_external_rest_endpoint 
        @url = @url,
        @method = 'POST',   
        @payload = @payload,   
        @headers = '{"Content-Type":"application/json", "api-key":"<openAIkey>"}', 
        @response = @response OUTPUT;

    DECLARE @jsonArray NVARCHAR(MAX) = JSON_QUERY(@response, '$.result.data[0].embedding');
    SET @embedding = JSON_ARRAY_TO_VECTOR(@jsonArray);
END
GO

This stored procedure retrieves embeddings from the Azure OpenAI model and converts them into a binary format for storage in the database, making them available for similarity search and other operations.

Let’s implementing a experiment with the Native Vector Support in Azure SQL

Azure SQL Database provides a seamless way to store and manage vector data despite not having a specific vector data type. Column-store indexes, vectors, and essentially lists of numbers can be efficiently stored in a table. Each vector can be represented in a row with individual elements as columns or serialized arrays. This approach ensures efficient storage and retrieval, making Azure SQL suitable for large-scale vector data management.

I used the Global News Dataset from Kaggle in my experiment.

First, you must create the columns to save the vector information. In my case, I created two columns: title_vector For the news title and content_vector the news content. For this, create a small Python code, but you can also do that directly from SQL using a cursor. It's important to know that you don't need to pay for any Vector Databases by saving the vector information inside the Azure SQL.

Python

from litellm import embedding
import pyodbc  # or another SQL connection library
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Set up OpenAI credentials from environment variables
os.environ['AZURE_API_KEY'] =os.getenv('AZURE_API_KEY')
os.environ['AZURE_API_BASE'] = os.getenv('AZURE_API_BASE')
os.environ['AZURE_API_VERSION'] = os.getenv('AZURE_API_VERSION')

# Connect to your Azure SQL database
conn = pyodbc.connect(f'DRIVER={{ODBC Driver 17 for SQL Server}};'
                      f'SERVER={os.getenv("DB_SERVER")};'
                      f'DATABASE={os.getenv("DB_DATABASE")};'
                      f'UID={os.getenv("DB_UID")};'
                      f'PWD={os.getenv("DB_PWD")}')

def get_embeddings(text):
    # Truncate the text to 8191 characters bacause of the text-embedding-3-     small OpenAI API Embedding model limit
    truncated_text = text[:8191]

    response = embedding(
        model="azure/text-embedding-3-small",
        input=truncated_text,
        api_key=os.getenv('AZURE_API_KEY'),
        api_base=os.getenv('AZURE_API_BASE'),
        api_version=os.getenv('AZURE_API_VERSION')
        )
        
    embeddings = response['data'][0]['embedding']
    return embeddings


def update_database(article_id, title_vector, content_vector):
    cursor = conn.cursor()

    # Convert vectors to strings
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # Update the SQL query to use the string representations
    cursor.execute("""
        UPDATE newsvector
        SET title_vector = ?, content_vector = ?
        WHERE article_id = ?
    """, (title_vector_str, content_vector_str, article_id))
    conn.commit()


def embed_and_update():
    cursor = conn.cursor()
    cursor.execute("SELECT article_id, title, full_content FROM newsvector where title_vector is null and full_content is not null and title is not null order by published asc")
    
    title_vector = ""
    content_vector = ""
    
    for row in cursor.fetchall():
        article_id, title, full_content = row
        
        print(f"Embedding article {article_id} - {title}")
        
        title_vector = get_embeddings(title)
        content_vector = get_embeddings(full_content)
        
        update_database(article_id, title_vector, content_vector)

embed_and_update()

These two columns will contain something like this: [-0.02232750505208969, -0.03755787014961243, -0.0066827102564275265…]

Second, you must create a procedure in the Azure Database to transform the query into a vector embedding.

SQL

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
ALTER PROCEDURE [dbo].[GET_EMBEDDINGS]
(
    @model VARCHAR(MAX),
    @text NVARCHAR(MAX),
    @embedding VARBINARY(8000) OUTPUT
)
AS
BEGIN
    DECLARE @retval INT, @response NVARCHAR(MAX);
    DECLARE @url VARCHAR(MAX);
    DECLARE @payload NVARCHAR(MAX) = JSON_OBJECT('input': @text);

    -- Set the @url variable with proper concatenation before the EXEC statement
    SET @url = 'https://<Your App>.openai.azure.com/openai/deployments/' + @model + '/embeddings?api-version=2024-02-15-preview';

    EXEC dbo.sp_invoke_external_rest_endpoint 
        @url = @url,
        @method = 'POST',   
        @payload = @payload,   
        @headers = '{"Content-Type":"application/json", "api-key":"<Your Azure Open AI API Key"}', 
        @response = @response OUTPUT;

    -- Use JSON_QUERY to extract the embedding array directly
    DECLARE @jsonArray NVARCHAR(MAX) = JSON_QUERY(@response, '$.result.data[0].embedding');

    
    SET @embedding = JSON_ARRAY_TO_VECTOR(@jsonArray);
END

I also create another procedure to search directly to the dataset using the Native Vector Support in Azure SQL.

SQL

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO

ALTER PROCEDURE [dbo].[SearchNewsVector] 
    @inputText NVARCHAR(MAX)
AS
BEGIN
    -- Query the SimilarNewsContentArticles table using the response
    IF OBJECT_ID('dbo.result', 'U') IS NOT NULL
        DROP TABLE dbo.result;

	--Assuming you have a stored procedure to get embeddings for a given text
	DECLARE @e VARBINARY(8000);
	EXEC dbo.GET_EMBEDDINGS @model = 'text-embedding-3-small', @text = @inputText, @embedding = @e OUTPUT;

	SELECT TOP(10) 
       [article_id]
      ,[source_id]
      ,[source_name]
      ,[author]
      ,[title]
      ,[description]
      ,[url]
      ,[url_to_image]
      ,[content]
      ,[category]
      ,[full_content]
      ,[title_vector]
      ,[content_vector]
      ,[published]
      ,VECTOR_DISTANCE('cosine', @e, VectorBinary) AS cosine_distance
	into result
	FROM newsvector
	ORDER BY cosine_distance;
END

Finally, you can start querying your table using prompts instead of keywords. This is awesome!

Check out the app I developed with the Native Vector Support in Azure SQL, which is designed to assist you in crafting prompts and evaluating your performance using my newsvector dataset. To explore the app, click here.

Like always, I also created this GitHub repository with everything I did.

Azure SQL Database Native vector support subscription for the Private Preview

You can sign up for the private preview at this link.

This article, published by Davide Mauri and Pooja Kamath at Microsoft Build 2024 event, provides all the information.

Announcing EAP for Vector Support in Azure SQL Database – Azure SQL Devs’ Corner (microsoft.com)

Conclusion

The integration of Azure OpenAI with native vector support in Azure SQL Database unlocks new possibilities for applications that require advanced search capabilities and data analysis. By storing and querying vector embeddings alongside traditional SQL data, you can build powerful solutions that combine the best of both worlds—semantic understanding with the reliability and performance of Azure SQL.

This innovation simplifies application development, enhances data insights, and paves the way for the next generation of intelligent applications.

That’s it for today!

Sources

Azure SQL DB Vector Functions Private Preview | Data Exposed (youtube.com)

Announcing EAP for Vector Support in Azure SQL Database – Azure SQL Devs’ Corner (microsoft.com)

The New Black Gold: How Data Became the Most Valuable Asset in Tech

In the annals of history, the term “black gold” traditionally referred to oil, a commodity that powered the growth of modern economies, ignited wars, and led to the exploration of uncharted territories. Fast forward to the 21st century, and a new form of black gold has emerged, one that is intangible yet infinitely more powerful: data. This precious commodity has become the cornerstone of technological innovation, driving the evolution of artificial intelligence (AI), shaping economies, and transforming industries. Let’s dive into how data ascended to its status as the most valuable asset in technology.

The Economic Power of Data

Data has transcended its role as a mere resource for business insights and operations, becoming a pivotal economic asset. Companies that possess vast amounts of data or have the capability to efficiently process and analyze data hold significant economic power and influence. This influence is not just limited to the tech industry but extends across all sectors, including healthcare, finance, and manufacturing, to name a few. Leveraging data effectively can lead to groundbreaking innovations, disrupt industries, and create new markets.

Image sourced from this website: Value in the digital economy: data monetised (nationthailand.com)

The economic potential of data is immense. The ability to harness insights from data translates into a competitive advantage for businesses. Predictive analytics, driven by data, enable companies to forecast customer behavior, optimize pricing strategies, and streamline supply chains. Data analysis is critical to personalized medicine, diagnostics, and drug discovery in healthcare. In the financial sector, data-driven algorithms power trading strategies and risk management assessments. Data’s reach extends beyond traditional industries, transforming fields like agriculture through precision farming and intelligent sensors.

The rise of data-driven decision-making has given birth to a thriving data economy. Companies specialize in aggregating, cleansing, and enriching datasets, turning them into marketable assets. The development of machine learning and artificial intelligence tools, combined with big data, enables more sophisticated and transformative data usage. Industries across the spectrum recognize the power of data, fueling investment in technologies and talent, with data scientists and analysts finding themselves in high demand.

The Rise of Data as a Commodity

The rise of data as a commodity represents a significant shift in the global economy, where the value of intangible assets, specifically digital data, has surpassed that of traditional physical commodities. This transition reflects the increasing importance of data in driving innovation, enhancing productivity, and fostering economic growth.

According to International Banker, the value of data has escalated because of the vast volumes available to financial services and other organizations, coupled with the nearly limitless processing power of cloud computing. This has enabled the manipulation, integration, and analysis of diverse data sources, transforming data into a critical asset for the banking sector and beyond. Robotics and Automation News further illustrates this by noting the exponential rise in Internet-connected devices, which has led to the generation of staggering amounts of data daily. As of 2018, more than 22 billion Internet-of-Things (IoT) devices were active, highlighting the vast scale of data generation and its potential value.

MIT Technology Review emphasizes data as a form of capital, akin to financial and human capital, which is essential for creating new digital products and services. This perspective is supported by studies indicating that businesses prioritizing “data-driven decision-making” achieve significantly higher output and productivity. Consequently, companies rich in data assets, such as Airbnb, Facebook, and Netflix, have redefined competition within their industries, underscoring the need for traditional companies to adopt a data-centric mindset.

Data transformation into a valuable commodity is not just a technological or economic issue but also entails significant implications for privacy, security, and governance. As organizations harness the power of data to drive business and innovation, the ethical considerations surrounding data collection, processing, and use become increasingly paramount.

In summary, the rise of data as a commodity marks a pivotal development in the digital economy, highlighting the critical role of data in shaping future economic landscapes, driving innovation, and redefining traditional industry paradigms.

The Challenges and Ethics of Data Acquisition

The discourse on the challenges and ethics of data acquisition and the application of artificial intelligence (AI) spans various considerations, reflecting the intricate web of moral, societal, and legal issues that modern technology presents. As AI becomes increasingly integrated into various facets of daily life, its potential to transform industries, enhance efficiency, and contribute to societal welfare is matched by significant ethical and societal challenges. These challenges revolve around privacy, discrimination, accountability, transparency, and the overarching role of human judgment in the age of autonomous decision-making systems (OpenMind, Harvard Gazette).

The ethical use of data and AI involves a nuanced approach that encompasses not just the legal compliance aspect but also the moral obligations organizations and developers have towards individuals and society at large. This includes ensuring privacy through anonymization and differential privacy, promoting inclusivity by actively seeking out diverse data sources to mitigate systemic biases, and maintaining transparency about how data is collected, used, and shared. Ethical data collection practices emphasize the importance of the data life cycle, ensuring accountability and accuracy from the point of collection to eventual disposal (Omdena, ADP).

Moreover, the ethical landscape of AI and data use extends to addressing concerns about unemployment and the societal implications of automation. As AI continues to automate tasks traditionally performed by humans, questions about the future of work, socio-economic inequality, and environmental impacts come to the forefront. Ethical considerations also include automating decision-making processes, which can either benefit or harm society based on the ethical standards encoded within AI systems. The potential for AI to exacerbate existing disparities and the risk of moral deskilling among humans as decision-making is increasingly outsourced to machines underscores the need for a comprehensive ethical framework governing AI development and deployment (Markkula Center for Applied Ethics).

In this context, the principles of transparency, fairness, and responsible stewardship of data and AI technologies form the foundation of ethical practice. Organizations are encouraged to be transparent about their data practices, ensure fairness in AI outcomes to avoid amplifying biases, and engage in ethical deliberation to navigate the complex interplay of competing interests and values. Adhering to these principles aims to harness the benefits of AI and data analytics while safeguarding individual rights and promoting societal well-being (ADP).

How is the “new black gold” being utilized?

1. AI-driven facial Emotion Detection

Overview: This application uses deep learning algorithms to analyze facial expressions and detect emotions. This technology provides insights into human emotions and behavior and is used in various fields, including security, marketing, and healthcare.
Data Utilization: By training on vast datasets of facial images tagged with emotional states, the AI can learn to identify subtle expressions, showcasing the critical role of diverse and extensive data in enhancing algorithm accuracy.

2. Food Freshness Monitoring Systems

Overview: A practical application that employs AI to monitor the freshness of food items in your fridge. It utilizes image recognition and machine learning to detect signs of spoilage or expiration.
Data Requirement: This system relies on a comprehensive dataset of food items in various states of freshness, learning from visual cues to accurately predict when food might have gone wrong. Thus, it reduces waste and ensures health safety.

3. Conversational AI Revolutionized

Overview: Large Language Models (LLMs), like ChatGPT, Gemini, Claude, and others, are state-of-the-art language models developed by OpenAI that simulate human-like conversations, providing responses that can be indistinguishable from a human’s. It’s used in customer service, marketing, education, and entertainment.
Data Foundation: The development of LLMs required extensive training on diverse language data from books, websites, and other textual sources, highlighting the need for large, varied datasets to achieve nuanced understanding and generation of human language.

4. Synthetic Data Generation for AI Training

Overview: To address privacy concerns and the scarcity of certain types of training data, some AI projects are turning to synthetic data generation. This involves creating artificial datasets that mimic real-world data, enabling the continued development of AI without compromising privacy.
Application of Data: These projects illustrate the innovative use of algorithms to generate new data points, demonstrating how unique data needs push the boundaries of what’s possible in AI research and development.

What are Crawling Services and Platforms?

Crawling services and platforms are specialized software tools and infrastructure designed to navigate and index the content of websites across the internet systematically. These services work by visiting web pages, reading their content, and following links to other pages within the same or different websites, effectively mapping the web structure. The data collected through this process can include text, images, and other multimedia content, which is then used for various purposes, such as web indexing for search engines, data collection for market research, content aggregation for news or social media monitoring, and more. Crawling platforms often provide APIs or user interfaces to enable customized crawls based on specific criteria, such as keyword searches, domain specifications, or content types. This technology is fundamental for search engines to provide up-to-date results and for businesses and researchers to gather and analyze web data at scale.

Here are some practical examples to enhance your understanding of the concept:

1. Common Crawl

Overview: Common Crawl is a nonprofit organization that offers a massive archive of web-crawled data. It crawls the web at scale, providing access to petabytes of data, including web pages, links, and metadata, all freely available to the public.
Utility for Data Acquisition: Common Crawl is instrumental for researchers, companies, and developers looking to analyze web data at scale without deploying their own crawlers, thus democratizing access to large-scale web data.

2. Bright Data (Formerly Luminati)

Overview: Bright Data is recognized as one of the leading web data platforms, offering comprehensive web scraping and data collection solutions. It provides tools for both code-driven and no-code data collection, catering to various needs from simple data extraction to complex data intelligence.
Features and Applications: With its robust infrastructure, including a vast proxy network and advanced data collection tools, Bright Data enables users to scrape data across the internet ethically. It supports various use cases, from market research to competitive analysis, ensuring compliance and high-quality data output.

3. Developer Tools: Playwright, Puppeteer and Selenium

Overview: For those seeking a more hands-on approach to web scraping, developer tools like Playwright, Puppeteer, and Selenium offer frameworks for automating browser environments. These tools are essential for developers building custom crawlers that programmatically navigate and extract data from web pages.
Use in Data Collection: By leveraging these tools, developers can create sophisticated scripts that mimic human navigation patterns, bypass captcha challenges, and extract specific data points from complex web pages, enabling precise and targeted data collection strategies.

4. No-Code Data Collection Platforms

Overview: Recognizing the demand for simpler, more accessible data collection methods, several platforms now offer no-code solutions that allow users to scrape and collect web data without writing a single line of code.
Impact on Data Acquisition: These platforms lower the barrier to entry for data collection, making it possible for non-technical users to gather data for analysis, market research, or content aggregation, further expanding the pool of individuals and organizations that can leverage web data.

Examples of No-Code Data Collection Platforms

1. ParseHub

Description: ParseHub is a powerful and intuitive web scraping tool that allows users to collect data from websites using a point-and-click interface. It can handle websites with JavaScript, redirects, and AJAX.
Website: https://www.parsehub.com/

3. WebHarvy

Description: WebHarvy is a visual web scraping software that can automatically scrape images, texts, URLs, and emails from websites using a built-in browser. It’s designed for users who prefer a visual approach to data extraction.
Website: https://www.webharvy.com/

4. Import.io

Description: Import.io offers a more comprehensive suite of data integration tools and web scraping capabilities. It allows no-code data extraction from web pages and can transform and integrate this data with various applications.
Website: https://www.import.io/

5. DataMiner

Description: DataMiner is a Chrome and Edge browser extension that allows you to scrape data from web pages and into various file formats like Excel, CSV, or Google Sheets. It offers pre-made data scraping templates and a point-and-click interface to select the data you want to extract.
Website: Find it on the Chrome Web Store or Microsoft Edge Add-ons

These platforms vary in capabilities, from simple scraping tasks to more complex data extraction and integration functionalities, catering to a wide range of user needs without requiring coding skills.

5. Other great web scraping tool options include

1. Apify

Description: Apify is a cloud-based web scraping and automation platform that utilizes Puppeteer, Playwright, and other technologies to extract data from websites, automate workflows, and integrate with various APIs. It offers a ready-to-use library of actors (scrapers) for everyday tasks and allows users to develop custom solutions.
Website: https://apify.com/

2. ScrapingBee

Description: ScrapingBee is a web scraping API that handles headless browsers and rotating proxies, allowing users to scrape challenging websites easily. It supports both Puppeteer and Playwright, enabling developers to execute JavaScript-heavy scraping tasks without getting blocked.
Website: https://www.scrapingbee.com/

3. Browserless

Description: Browserless is a cloud service that provides a scalable and reliable way to run Puppeteer and Playwright scripts in the cloud. It’s designed for developers and businesses needing to automate browsers at scale for web scraping, testing, and automation tasks without managing their browser infrastructure.
Website: https://www.browserless.io/

4. Octoparse

Description: While Octoparse itself is primarily a no-code web scraping tool, it provides advanced options that allow integration with custom scripts, potentially incorporating Puppeteer or Playwright for specific data extraction tasks, especially when dealing with websites that require interaction or execute complex JavaScript.
Website: https://www.octoparse.com/

5. ZenRows

Description: ZenRows is a web scraping API that simplifies the process of extracting web data and handling proxies, browsers, and CAPTCHAs. It supports Puppeteer and Playwright, making it easier for developers to scrape data from modern web applications that rely heavily on JavaScript.
Website: https://www.zenrows.com/

Looking to the Future

As AI technologies like ChatGPT and DALL-E 3 continue to evolve, powered by vast amounts of data, researchers have raised concerns about a potential shortage of high-quality training data by 2026. This scarcity could impede the growth and effectiveness of AI systems, given the need for large, high-quality datasets to develop accurate and sophisticated algorithms. High-quality data is crucial for avoiding biases and inaccuracies in AI outputs, as seen in cases where AI has replicated undesirable behaviors from low-quality training sources. To address this impending data shortage, the industry could turn to improved AI algorithms to better use existing data, generate synthetic data, and explore new sources of high-quality content, including negotiating with content owners for access to previously untapped resources. These strategies aim to sustain the development of AI technologies and mitigate ethical concerns by potentially offering compensation for the use of creators’ content.

Looking to the future, the importance of data, likened to the new black gold, is poised to grow exponentially, heralding a future prosperous with innovation and opportunity. Anticipated advancements in data processing technologies, such as quantum and edge computing, promise to enhance the efficiency and accessibility of data analytics, transforming the landscape of information analysis. The emergence of synthetic data stands out as a groundbreaking solution to navigate privacy concerns, enabling the development of AI and machine learning without compromising individual privacy. These innovations indicate a horizon brimming with potential for transformative changes in collecting, analyzing, and utilizing data.

However, the true challenge and opportunity lie in democratizing access to this vast wealth of information, ensuring that the benefits of data are not confined to a select few but are shared across the global community. Developing equitable data-sharing models and open data initiatives will be crucial in leveling the playing field, offering startups, researchers, and underrepresented communities the chance to participate in and contribute to the data-driven revolution. As we navigate this promising yet complex future, prioritizing ethical considerations, transparency, and the responsible use of data will be paramount in fostering an environment where innovation and opportunity can flourish for all, effectively addressing the challenges of data scarcity and shaping a future enriched by data-driven advancements.

Conclusion

The elevation of data to the status of the most valuable asset in technology marks a pivotal transformation in our global economy and society. This shift reflects a more profound change in our collective priorities, recognizing data’s immense potential for catalyzing innovation, driving economic expansion, and solving complex challenges. However, with great power comes great responsibility. As we harness this new black gold, our data-driven endeavors’ ethical considerations and societal impacts become increasingly significant. Ensuring that the benefits of data are equitably distributed and that privacy, security, and ethical use are prioritized is essential for fostering trust and sustainability in technological advancement.

We encounter unparalleled opportunities and profound challenges in navigating the future technology landscape powered by the vast data reserves. The potential for data to improve lives, streamline industries, and open new frontiers of knowledge is immense. Yet, this potential must be balanced with vigilance against the risks of misuse, bias, and inequality arising from unchecked data proliferation. Crafting policies, frameworks, and technologies that safeguard individual rights while promoting innovation will be crucial in realizing the full promise of data. Collaborative efforts among governments, businesses, and civil society to establish norms and standards for data use can help ensure that technological progress serves the broader interests of humanity.

As we look to the future, the journey of data as the cornerstone of technological advancement is only beginning. Exploring this new black gold will continue to reshape our world, offering pathways to previously unimaginable possibilities. Yet, the accurate measure of our success in this endeavor will not be in the quantity of data collected or the sophisticated algorithms developed but in how well we leverage this resource to enhance human well-being, foster sustainable development, and bridge the divides that separate us. In this endeavor, our collective creativity, ethical commitment, and collaborative spirit will be our most valuable assets, guiding us toward a future where technology, powered by data, benefits all of humanity.

That’s it for today!

Sources

https://www.frontiersin.org/articles/10.3389/fsurg.2022.862322/full

Researchers warn we could run out of data to train AI by 2026. What then? (theconversation.com)

(138) The Business Case for AI Data Analytics in 2024 – YouTube

OpenAI Asks Public for More Data to Train Its AI Models (aibusiness.com)

Navigating Vector Operations in Azure SQL for Better Data Insights: A Guide How to Use Generative AI to Prompt Queries in Datasets

The evolving landscape of data analytics has brought vector databases to the forefront, especially with their application in finding similarities in diverse data types such as articles, photos, and products. Azure SQL, combined with the prowess of OpenAI, offers a powerful platform for executing vector operations, simplifying the task of finding similar items and enhancing recommendation systems in applications.

What is Vector Similarity, and How do you calculate cosine similarity?

Vector similarity revolves around transforming data into numerical vectors or embeddings. These embeddings are numerical representations of various concepts converted into sequences of numbers, making it easier for computers to grasp their relationships. This method is particularly effective in comparing and finding similarities between data points, a crucial feature in applications like search engines and clustering algorithms.

Cosine similarity, a commonly used metric in vector similarity, measures the cosine of the angle between two vectors. This metric is crucial in determining the degree of similarity between the vectors, irrespective of their size. In Azure SQL, cosine similarity can be computed with a simple SQL formula involving SUM and SQRT functions applied to the vector elements, thus providing a straightforward yet powerful way to measure vector similarity.

Cosine similarity can be calculated in SQL using the following formula, given two vectors, a and b:

SELECT 
    SUM(a.value * b.value) / (  
        SQRT(SUM(a.value * a.value)) * SQRT(SUM(b.value * b.value))   
    ) AS cosine_similarity
FROM
    vectors_values

How to Querying Azure OpenAI?

Azure SQL’s integration with Azure OpenAI simplifies generating and working with embeddings. Users can obtain the vector representation of any given text by creating an Azure OpenAI resource and deploying a model like text-embedding-ada-002. This integration enables the execution of REST API calls from within the Azure SQL Database, making fetching and working with embeddings more streamlined and accessible.

SQL

-- Declare a variable to hold the response from the external REST API
declare @response nvarchar(max);

-- Declare and initialize a variable with JSON payload. 
-- The JSON object contains an 'input' key with a text value.
declare @payload nvarchar(max) = json_object('input': 'This is how to futureproof your career in an AI world');

-- Execute a stored procedure to invoke an external REST endpoint
exec sp_invoke_external_rest_endpoint
    @url = 'https://<your-app-name>.openai.azure.com/openai/deployments/embeddings/embeddings?api-version=2023-03-15-preview', -- The URL of the REST endpoint
    @credential = [https://<your-app-name>.openai.azure.com], -- Credential for accessing the REST API
    @payload = @payload, -- The JSON payload defined earlier
    @response = @response output; -- Output parameter to store the response from the API

Advantages of Implementing Vectors in Azure SQL Database

1. Ease of Access with Azure OpenAI: Azure SQL combined with Azure OpenAI offers easy access to REST services for generating embeddings using pre-trained machine learning models. This accessibility facilitates the calculation of embeddings, which is otherwise a complex task.

2. Efficient Storage with Columnstore Indexes: Azure SQL databases efficiently store vectors using column store indexes. This method is particularly beneficial since Azure SQL doesn’t have a specific vector data type. Vectors, essentially lists of numbers, can be conveniently stored in a table with one row per vector element.

3. Fast Distance Calculation: The internal optimization of column store indexes in Azure SQL, employing SIMD and AVX-512 instructions, allows for high-speed calculation of distances between vectors, which is crucial for determining similarity.

4. Integration with Azure AI Search: Azure SQL’s integration with Azure AI Search streamlines the entire process of chunking, generating, storing, and querying vectors for vector search, significantly speeding up the development of the vectorization pipeline and minimizing maintenance tasks.

5. Capability for Complex Operations: Azure SQL enables complex operations like indexing, storing, and retrieving vector embeddings from a search index, which is essential for identifying the most similar documents in a vector space.

6. Versatile Data Handling: Azure SQL’s ability to handle structured and unstructured data, along with vector data, provides more versatility compared to vector databases, which are primarily optimized for vector data storage and retrieval.

7. Ease of Querying and Relevance Determination: The integration with Azure OpenAI allows easy querying of the REST service to obtain vector representations of text, which can then be used to calculate similarity against stored vectors, identifying the most relevant data.

8. Simplified Deployment and Management: Deploying and managing an embedding model via the Azure portal is straightforward, reducing the complexity of managing vector databases.

9. Suitability for a Range of Applications: While vector databases are specialized for high-dimensional similarity searches, Azure SQL’s broader application scope makes it suitable for various types of data and applications, from financial records to customer data.

10. Support for Advanced Azure Features: Azure SQL supports advanced Azure features, such as AI and machine learning capabilities, which can be seamlessly integrated with vector similarity operations for enhanced analytics and insights.

These advantages highlight the flexibility, efficiency, and ease of use of Azure SQL databases for vector similarity operations, making them a preferable choice in scenarios where diverse data types and complex operations are involved, alongside the need for seamless integration with other Azure services.

Implementing Vectors in Azure SQL Database

I used the Global News Dataset from Kaggle in my experiment.

Python

import openai
import pyodbc  # or another SQL connection library

# Set up OpenAI credentials
openai.api_type = "azure"
openai.api_key = "<YOUR AZURE OPEN AI KEY>"
openai.api_base = "https://<your-app-name>.openai.azure.com/"
openai.api_version = "2023-07-01-preview"

# Connect to your Azure SQL database
conn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};'
                      'SERVER=<SERVER>;DATABASE=<DATABASE>;'
                      'UID=<USER>;PWD=<PASSWORD>')

def get_embeddings(text):
    # Truncate the text to 8000 characters
    truncated_text = text[:8000]

    response = openai.Embedding.create(input=truncated_text, engine="embeddings")
    embeddings = response['data'][0]['embedding']
    return embeddings

def update_database(article_id, title_vector, content_vector):
    cursor = conn.cursor()

    # Convert vectors to strings
    title_vector_str = str(title_vector)
    content_vector_str = str(content_vector)

    # Update the SQL query to use the string representations
    cursor.execute("""
        UPDATE news
        SET title_vector = ?, content_vector = ?
        WHERE article_id = ?
    """, (title_vector_str, content_vector_str, article_id))
    conn.commit()

def embed_and_update():
    cursor = conn.cursor()
    cursor.execute("SELECT article_id, title, full_content FROM news where title_vector is null and full_content is not null and title is not null order by published desc")
    
    title_vector = ""
    content_vector = ""
    
    for row in cursor.fetchall():
        article_id, title, full_content = row
        
        print(f"Embedding article {article_id} - {title}")
        
        title_vector = get_embeddings(title)
        content_vector = get_embeddings(full_content)
        
        update_database(article_id, title_vector, content_vector)

embed_and_update()

These two columns will contain something like this: [-0.02232750505208969, -0.03755787014961243, -0.0066827102564275265…]

Second, you must extract these vectors line by line into a new table for each vector field. In this case, I created a procedure to do it.

SQL

-- =============================================
-- Author:      Lawrence Teixeira
-- Create Date: 11-24-2023
-- Description: This stored procedure creates vectors for news titles and contents.
-- It processes data from the 'news' table and stores the vectors in separate tables.
-- =============================================

CREATE PROCEDURE [dbo].[Create_News_Vector]
AS
BEGIN
    SET NOCOUNT ON; -- Prevents the sending of DONE_IN_PROC messages to the client.

    -- Extract and store title vectors:
    -- First, check if the 'news_titles_vector' table exists and drop it if it does.
    IF OBJECT_ID('dbo.news_titles_vector', 'U') IS NOT NULL
        DROP TABLE dbo.news_titles_vector;

    -- Using a Common Table Expression (CTE) to process title vectors.
    WITH cte AS
    (
        SELECT 
            v.article_id,    
            CAST(tv.[key] AS INT) AS vector_value_id, -- Casting 'key' as INT for vector ID.
            CAST(tv.[value] AS FLOAT) AS vector_value   -- Casting 'value' as FLOAT for vector value.
        FROM 
            dbo.news AS v 
        CROSS APPLY 
            OPENJSON(v.title_vector) tv -- Parses JSON of title_vector in the 'news' table.
    )
    -- Create 'news_titles_vector' table with processed vectors.
    SELECT
        article_id,
        vector_value_id,
        vector_value
    INTO
        dbo.news_titles_vector
    FROM
        cte;

    -- Extract and store content vectors:
    -- Check and drop 'news_contents_vector' table if it exists.
    IF OBJECT_ID('dbo.news_contents_vector', 'U') IS NOT NULL
        DROP TABLE dbo.news_contents_vector;

    -- CTE for processing content vectors.
    WITH cte AS
    (
        SELECT 
            v.article_id,    
            CAST(tv.[key] AS INT) AS vector_value_id, -- Casting 'key' as INT for vector ID.
            CAST(tv.[value] AS FLOAT) AS vector_value   -- Casting 'value' as FLOAT for vector value.
        FROM 
            dbo.news AS v 
        CROSS APPLY 
            OPENJSON(v.content_vector) tv -- Parses JSON of content_vector in the 'news' table.
    )
    -- Create 'news_contents_vector' table with processed vectors.
    SELECT
        article_id,
        vector_value_id,
        vector_value
    INTO
        dbo.news_contents_vector
    FROM
        cte;

    -- Columnstore indexes creation is advised outside the stored procedure.
    -- These indexes optimize data storage and query performance on vector tables.
    CREATE CLUSTERED COLUMNSTORE INDEX cci_news_titles_vector
    ON dbo.news_titles_vector 
	order (article_id);

    CREATE CLUSTERED COLUMNSTORE INDEX cci_news_contents_vector
    ON dbo.news_contents_vector
	order (article_id);
END

I also create a function to search directly to the dataset using the Cosine similarity formula.

SQL

-- This Azure SQL function finds news articles similar to the given content vector.
CREATE function [dbo].[SimilarNewsContentArticles](
    @vector nvarchar(max) -- Input parameter: JSON string representing a content vector.
)
returns table -- The function returns a table.
as
return with 

-- CTE for processing the input vector.
cteVector as
(
    select 
        cast([key] as int) as [vector_value_id], -- Extracts and casts the 'key' from JSON to int.
        cast([value] as float) as [vector_value] -- Extracts and casts the 'value' from JSON to float.
    from 
        openjson(@vector) -- Parses the input JSON vector.
),

-- CTE for calculating similarity scores with existing articles.
cteSimilar as
(
    select top (50)
        v2.article_id, 
        sum(v1.[vector_value] * v2.[vector_value]) as cosine_distance 
        -- Calculates cosine similarity (distance) between vectors.
    from 
        cteVector v1 -- Uses the processed input vector.
    inner join 
        dbo.news_contents_vector v2 
        on v1.vector_value_id = v2.vector_value_id -- Joins with stored article vectors.
    group by
        v2.article_id
    order by
        cosine_distance desc -- Orders by similarity score, descending.
)

-- Final selection combining article data with similarity scores.
select 
    a.*, -- Selects all columns from the news article.
    r.cosine_distance -- Includes the calculated similarity score.
from 
    cteSimilar r -- Uses the similarity scores CTE.
inner join 
    dbo.news a on r.article_id = a.article_id -- Joins with the news articles table.
GO

Finally, you can start querying your table using prompts instead of keywords. This is awesome!

Check out the app I developed, which is designed to assist you in crafting prompts and evaluating your performance using my news dataset. To explore the app, click here.

I also created this GitHub repository with everything I did.

Click on this image to open the GitHub repository.

Conclusion

While the sample in this blog is not optimized for maximum efficiency, it is an excellent starting point for understanding and implementing vector operations in Azure SQL. The process, despite its simplicity, is relatively fast. For instance, querying an eight vCore Azure SQL database can return the fifty most similar articles in just half a second, demonstrating the efficiency and utility of vector operations in Azure SQL Database for data analytics and insights. If you want to know more about this topic, don’t hesitate to get in touch with me.

That’s it for today!

Sources:

Azure-Samples/azure-sql-db-openai: Samples on how to use Azure SQL database with Azure OpenAI (github.com)