If you’ve ever worked with Power BI in an enterprise environment, you’ve faced the same frustrating challenge that has plagued data professionals for years: comprehensive documentation. You spend weeks building sophisticated reports with complex DAX measures, intricate data models, and carefully crafted visualizations, only to realize that documenting everything properly will take nearly as long as building the solution itself.
The documentation dilemma is a real and costly issue. Teams often skip it due to time constraints, resulting in knowledge silos when developers leave the organization. Stakeholders struggle to understand the report’s logic without proper documentation. Compliance requirements go unmet. New team members take months to understand existing models. Manual documentation becomes outdated the moment a model changes.
What if there were a way to generate comprehensive, professional Power BI documentation automatically, in minutes rather than hours? What if you could chat with an AI assistant about your report’s structure, ask questions about specific DAX measures, and get detailed explanations about table relationshipsāall based on your actual model data?
Enter AutoDocāthe AI-powered solution that finally solves Power BI’s documentation problem once and for all.
What is AutoDoc?
AutoDoc is a revolutionary documentation generator specifically designed for Power BI. It harnesses the power of artificial intelligence to create comprehensive, professional documentation automatically. Think of it as having a dedicated documentation specialist who never sleeps, never misses details, and can analyze your entire Power BI model in minutes.
AutoDoc is an open-source tool that offers complete flexibility for implementation, both in the cloud and locally, through the repository available on GitHub. The solution allows secure execution in a local environment, including with local LLM models via Ollama, or can be securely hosted on platforms such as Microsoft Azure AI Foundry or Amazon Bedrock.
The Multi-AI Advantage
What sets AutoDoc apart from other documentation tools is its integration with multiple leading AI providers, giving you the flexibility to choose the language model that best fits your needs and budget:
OpenAI GPT-4.1 models (nano and mini variants)
Azure OpenAI GPT-41 nano for enterprise environments
Anthropic Claude 3.7 Sonnet for advanced reasoning
Google Gemini 2.5 Pro for comprehensive analysis
Llama 4 for open-source flexibility
Core Capabilities
Intelligent File Processing: AutoDoc supports both .pbit (Power BI Template) and .zip files, automatically extracting and analyzing all components of your Power BI model regardless of complexity.
Comprehensive Analysis: The tool meticulously documents every aspect of your Power BI solution, including tables, columns, measures, calculated fields, data sources, relationships, and Power Query transformations.
Professional Output Formats: Generate documentation in both Excel and Word formats, ensuring compatibility with your organization’s documentation standards and workflows.
Interactive AI Chat: Perhaps the most groundbreaking feature is AutoDoc’s intelligent chat system that allows you to have conversations about your Power BI model, asking specific questions about DAX logic, table relationships, or data transformations.
Multi-Language: You can create Power BI documentation in multiple languages, including English, Portuguese, and Spanish.
How to Use AutoDoc
Using AutoDoc is remarkably straightforward, designed with busy data professionals in mind who need results quickly without a steep learning curve.
Step 1: Access AutoDoc. Visit https://autodoc.lawrence.eti.br/ to access the web-based version, or set up a local installation for enhanced security and control.
Step 2: Select Your AI Engine. Choose from the available AI models based on your specific requirements. Each model offers distinct strengths: GPT-4.1 for general use, Claude for complex reasoning, and Gemini for comprehensive analysis.
Step 3: Provide Your Power BI Model. You have two flexible options for getting your model into AutoDoc:
Option A: Direct Upload
Save your Power BI file as a .pbit template or export as .zip
Upload directly to the AutoDoc interface
The system automatically processes and analyzes your model
Option B: API Integration. For direct integration with Power BI Service:
Input your App ID in the sidebar
Provide your Tenant ID
Enter your Secret Value
AutoDoc connects directly to your Power BI workspace
Step 4: Review Interactive Preview. Before generating final documentation, AutoDoc provides an interactive visualization of your data model, allowing you to:
Verify the accuracy of the extracted information
Review table structures and relationships
Confirm DAX measures and calculations
Check data source connections
Step 5: Generate Documentation. Select your preferred output format (Excel or Word) and download professional documentation that includes:
Complete table inventory with column details
All DAX measures with expressions
Data source documentation
Relationship mappings
Power Query transformation logic
Step 6: Leverage AI Chat. After documentation generation, click the “š¬ Chat” button to access the intelligent assistant. Ask questions like:
“Explain the logic behind the ‘Total Sales’ measure.”
“What relationships exist between the Customer and Orders tables?”
“Which columns in the Product table are calculated?”
“Show me all measures that reference the Date table.”
Token Configuration in AutoDoc
Depending on the size of your Power BI report, AutoDoc allows you to adjust the maximum number of input and output tokens to optimize processing.
What are tokens?Tokens are basic processing units of LLM models – they can be words, parts of words, or characters.
Input Tokens represent the amount of information the LLM model can process at once, including your report content and system instructions. This configuration allows you to:
Increase the value: Process more content simultaneously, reducing the number of required interactions
Decrease the value: Useful when the report is too large and exceeds model limits, forcing processing in smaller parts with more interactions.
Output Tokens: Define the maximum size of the response the model can generate. This configuration varies according to each LLM model’s capabilities and directly influences:
The length of the generated documentation
The completeness of the produced analyses
Processing time
Important: Each LLM model has specific token limitations. Refer to the documentation on this website to determine the exact limits and adjust these settings accordingly if necessary.
For organizations requiring enhanced security, compliance, or customization, AutoDoc offers complete local deployment capabilities. I created this open-source project, and you can find my GitHub repository here: https://github.com/LawrenceTeixeira/PBIAutoDoc
System Requirements
Operating System: Windows, macOS, or Linux Python Version: 3.10 or higher Network: Internet connection for AI model access API Access: Valid API keys for chosen AI providers
# Install core requirementspipinstall-rrequirements.txt# Install additional AI processing librarypipinstall--no-cache-dirchunkipy
4. Environment Variables Setup: Create a .env file in your project root:
Bash
# OpenAI ConfigurationOPENAI_API_KEY=your_openai_api_key# Groq Configuration GROQ_API_KEY=your_groq_api_key# Azure OpenAI ConfigurationAZURE_API_KEY=your_azure_api_keyAZURE_API_BASE=https://<your-alias>.openai.azure.comAZURE_API_VERSION=2024-02-15-preview# Google Gemini ConfigurationGEMINI_API_KEY=your_gemini_api_key# Anthropic Claude ConfigurationANTHROPIC_API_KEY=your_anthropic_api_key
5. Application Launch
Bash
# Standard launchstreamlitrunapp.py--server.fileWatcherTypenone# Alternative for specific environmentspython-Xutf8-mstreamlitrunapp.py--server.fileWatcherTypenone
Cloud Deployment Option
For scalable cloud deployment, AutoDoc supports Fly.io hosting:
Bash
# Install Fly CLIcurl-Lhttps://fly.io/install.sh | shexportPATH=/home/codespace/.fly/bin# Authentication and deploymentflyctlauthloginflyctllaunchflyctldeploy
What Are the Benefits?
AutoDoc delivers transformative benefits that address every central pain point in Power BI documentation:
Dramatic Time Savings
What traditionally takes hours or days now happens in minutes. Data professionals report saving 15-20 hours per week on documentation tasks, allowing them to focus on analysis and insights rather than administrative work.
Unmatched Accuracy and Completeness
Human documentation inevitably misses details or becomes outdated. AutoDoc captures every table, column, measure, and relationship automatically, ensuring nothing is overlooked and documentation remains current.
Professional Consistency
Every documentation output follows the same professional format and standard, regardless of who generates it or when. This consistency is crucial for enterprise environments and compliance requirements.
Enhanced Knowledge Transfer
The AI chat feature transforms documentation from static text into an interactive knowledge base. Team members can ask specific questions and get detailed explanations, dramatically reducing onboarding time for new staff.
Compliance and Audit Support
For heavily regulated industries, AutoDoc provides the comprehensive documentation required for compliance audits, with detailed tracking of data lineage, transformations, and business logic.
Improved Collaboration
Non-technical stakeholders can better understand Power BI solutions through clear, comprehensive documentation. The chat feature allows business users to ask questions about data definitions and calculations without requiring technical expertise.
Cost Efficiency
By automating documentation processes, organizations reduce the human resources required for documentation maintenance while improving quality and coverage.
Conclusion
AutoDoc represents more than just another documentation toolāit’s a paradigm shift that finally makes comprehensive Power BI documentation practical and sustainable. By combining cutting-edge AI technology with a deep understanding of Power BI architecture, AutoDoc solves the fundamental challenges that have made documentation a persistent pain point for data teams worldwide.
The tool’s multi-AI approach ensures flexibility and future-proofing, while its interactive chat capability transforms static documentation into a dynamic knowledge resource. Whether you’re a solo analyst struggling to document complex models or an enterprise data team managing hundreds of reports, AutoDoc adapts to your needs and scales with your organization.
The choice is clear: continue struggling with manual documentation processes that consume valuable time and often go incomplete, or embrace the AI-powered solution that makes comprehensive Power BI documentation effortless and automatic.
AutoDoc doesn’t just solve Power BI’s documentation problemāit eliminates it. The question isn’t whether you can afford to implement AutoDoc; it’s whether you can afford not to.
Should you have any questions or need assistance with AutoDoc, please donāt hesitate to contact me using the provided link: https://lawrence.eti.br/contact/
Information is arguably an organization’s most valuable asset in today’s data-driven world. However, without proper management, this asset can quickly become a liability. Microsoft Fabric, a revolutionary unified analytics platform, integrates everything from data engineering and data science to data warehousing and business intelligence into a single, SaaS-based environment. It provides powerful tools to store, process, analyze, and visualize vast data. But with great power comes great responsibility. To maintain trust, ensure security, uphold data quality, and meet ever-increasing compliance demands, implementing a robust data governance framework within Fabric isn’t just recommendedāit’s essential.
Effective data governance ensures that data remains accurate, secure, consistent, and usable throughout its entire lifecycle, aligning technical capabilities with strategic business goals and stringent regulatory requirements like GDPR, HIPAA, or CCPA. Within the Fabric ecosystem, this translates to leveraging its built-in governance features and its seamless integration with Microsoft Purview, Microsoft’s comprehensive data governance and compliance suite. The goal is to effectively manage and protect sensitive information while empowering users, from data engineers and analysts to business users and compliance officers, to confidently discover, access, and derive value from data within well-defined, secure guardrails.
A well-designed governance plan in Fabric strikes a critical balance between enabling user productivity and innovation and enforcing necessary controls for compliance and risk mitigation. It’s about establishing clear policies, defining roles and responsibilities, and implementing consistent processes so that, as the adage goes, āthe right people can take the right actions with the right data at the right timeā. This guide provides a practical, step-by-step approach to implementing such a framework within Microsoft Fabric, leveraging its native capabilities and Purview integration to build a governed, trustworthy data estate.
The Critical Importance of Data Governance
Data governance is more than just an IT buzzword or a compliance checkbox; it is a fundamental strategic imperative for any organization looking to leverage its data assets effectively and responsibly. The need for robust governance becomes even more pronounced in the context of a powerful, unified platform like Microsoft Fabric, which brings together diverse data workloads and user personas. Implementing strong data governance practices yields numerous critical benefits:
Ensuring Data Quality and Consistency: Governance establishes standards and processes for creation, maintenance, and usage, leading to more accurate, reliable, and consistent data across the organization. This is crucial for trustworthy analytics and informed decision-making. Poor data quality can lead to flawed insights, operational inefficiencies, and loss of credibility.
Enhancing Data Security and Protection: A core function of governance is to protect sensitive data from unauthorized access, breaches, or misuse. By defining access controls, implementing sensitivity labeling (using tools like Microsoft Purview Information Protection), and enforcing security policies, organizations can safeguard confidential information, protect intellectual property, and maintain customer privacy.
Meeting Regulatory Compliance Requirements: Organizations operate under a complex web of industry regulations and data privacy laws (such as GDPR, CCPA, HIPAA, SOX, etc.). Data governance provides the framework, controls, and audit trails necessary to demonstrate compliance, avoid hefty fines, and mitigate legal risks. Features like data lineage and auditing in Fabric, often powered by Purview, are essential.
Improving Data Discoverability and Usability: A well-governed data estate makes it easier for users to find the data they need. Features like the OneLake data hub, data catalogs, business glossaries, endorsements (certifying or promoting assets), and descriptive metadata help users quickly locate relevant, trustworthy data, fostering reuse and reducing redundant data preparation efforts.
Building Trust and Confidence: When users know that data is well-managed, secure, and accurate, they have greater confidence in the insights derived from it. This trust is foundational for fostering a data-driven culture where decisions are based on reliable evidence.
Optimizing Operational Efficiency: Governance helps streamline data-related processes, reduce data duplication, clarify ownership, and improve team collaboration. This leads to increased efficiency, reduced costs for managing poor-quality or redundant data, and faster time-to-insight.
Enabling Scalability and Innovation: While governance involves controls, it also provides the necessary structure to manage data effectively as volumes and complexity grow. A solid governance foundation allows organizations to innovate confidently, knowing their data practices are sound and scalable.
Data governance transforms data from a potential risk into a reliable, strategic asset, enabling organizations to maximize their value while minimizing associated risks within the Microsoft Fabric environment.
An Overview of Microsoft Fabric
Understanding the platform itself is helpful before diving into the specifics of governance implementation. Microsoft Fabric represents a significant evolution in the analytics landscape, offering an end-to-end, unified platform delivered as a Software-as-a-Service (SaaS) solution. It aims to simplify analytics for organizations by combining disparate data tools and services into a single, cohesive environment built around a central data lake called OneLake.
Fabric integrates various data and analytics workloads, often referred to as “experiences,” which traditionally required separate, usually complex, integrations:
Data Factory: Provides data integration capabilities for ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) processes, enabling data movement and transformation at scale.
Synapse Data Engineering: A Spark-based large-scale data transformation and preparation platform primarily uses notebooks.
Synapse Data Science: Provides an end-to-end workflow for data scientists to build, deploy, and manage machine learning models.
Synapse Data Warehousing: Delivers a next-generation SQL engine for traditional data warehousing workloads, offering high performance over open data formats.
Synapse Real-Time Analytics: This technology enables the real-time analysis of data streaming from various sources, such as IoT devices and logs.
Power BI: The well-established business intelligence and visualization service, fully integrated for reporting and analytics.
Data Activator: A no-code experience for monitoring data and triggering actions based on detected patterns or conditions.
Shortcuts allow your organization to easily share data between users and applications without unnecessarily moving and duplicating information. When teams work independently in separate workspaces, shortcuts enable you to combine data across different business groups and domains into a virtual data product to fit a userās specific needs.
A shortcut is a reference to data stored in other file locations. These file locations can be within the same workspace or across different workspaces, within OneLake or external to OneLake in ADLS, S3, or Dataverse, with more target locations coming soon. No matter the location, shortcuts make files and folders look like you have stored locally. For more information on how to use shortcuts, see OneLake shortcuts.
Underpinning all these experiences is OneLake, Fabric’s built-in, tenant-wide data lake. OneLake eliminates data silos by providing a single, unified storage system for all data within Fabric, regardless of which experience created or uses it. It’s built on Azure Data Lake Storage Gen2. Still, it adds shortcuts (allowing data to be referenced without moving or duplicating it) and a unified namespace, simplifying data management and access.
This unified architecture has profound implications for governance. By centralizing data storage (OneLake) and providing a familiar administrative interface (Fabric Admin Portal), Fabric facilitates the application of consistent governance policies, security controls, and monitoring across the entire analytics lifecycle. Features like sensitivity labels and lineage can often propagate automatically across different Fabric items, simplifying the task of governing a complex data estate. Understanding this integrated nature is key to effectively implementing governance within the platform.
Understanding Microsoft Purview: The Governance Foundation
While Microsoft Fabric provides the unified analytics platform, Microsoft Purview is the overarching data governance, risk, and compliance solution that integrates deeply with Fabric to manage and protect the entire data estate. Understanding Purview’s role is crucial for implementing effective governance in Fabric.
Microsoft Purview is a family of solutions designed to help organizations govern, protect, and manage data across their entire landscape, including Microsoft 365, on-premises systems, multi-cloud environments, and SaaS applications like Fabric. Its key capabilities relevant to Fabric governance include:
Unified Data Catalog: Purview automatically discovers and catalogs Fabric items (like lakehouses, warehouses, datasets, reports) alongside other data assets. It creates an up-to-date map of the data estate, enabling users to easily find and understand data through search, browsing, and business glossary terms.
Data Classification and Sensitivity Labels: Through integration with Microsoft Purview Information Protection, Purview allows organizations to define sensitivity labels (e.g., Confidential, PII) and apply them consistently across Fabric items. This classification helps identify sensitive data and drives protection policies.
End-to-End Data Lineage: Purview provides visualization of data lineage, showing how data flows and transforms from its source through various Fabric processes (e.g., Data Factory pipelines, notebooks) down to Power BI reports. This is vital for impact analysis, troubleshooting, and demonstrating compliance.
Data Loss Prevention (DLP): Purview DLP policies can be configured (currently primarily for Power BI semantic models within Fabric) to detect sensitive information based on classifications or patterns (like credit card numbers) and prevent its unauthorized sharing or exfiltration, providing alerts and policy tips.
Auditing: All user and administrative activities within Fabric are logged and made available through Microsoft Purview Audit, providing a comprehensive trail for security monitoring and compliance investigations.
Purview Hub in Fabric: This centralized page within the Fabric experience provides administrators and governance stakeholders with insights into their Fabric data estate, including sensitivity labeling coverage, endorsement status, and a gateway to the broader Purview governance portal.
Purview is the central governance plane that overlays Fabric (and other data sources), providing the tools to define policies, classify data, track lineage, enforce protection, and consistently monitor activities. The seamless integration ensures that as data moves and transforms within Fabric, the governance context (like sensitivity labels and lineage) is maintained, enabling organizations to build a truly governed and trustworthy analytics environment.
Step-by-Step Process for Implementing Data Governance in Microsoft Fabric
Implementing data governance in Microsoft Fabric is a phased process that involves defining policies, configuring technical controls, assigning responsibilities, and establishing ongoing monitoring. Hereās a practical step-by-step guide:
Step 1: Define Your Governance Policies and Framework
Before configuring any tools, establish the foundation ā your governance framework. This involves defining the rules, standards, and responsibilities that will guide data handling within Fabric.
Identify Stakeholders and Requirements: Assemble a cross-functional team including representatives from IT, data management, legal, compliance, and key business units. Collaboratively identify all applicable external regulations (e.g., GDPR, HIPAA, or CCPA) and internal business requirements (e.g., data quality standards, retention policies, ethical use guidelines). Understanding these requirements is crucial for tailoring your policies.
Develop Data Classification Policies: Define clear data sensitivity levels (e.g., Public, Internal, Confidential, Highly Restricted). Map these levels to Microsoft Purview Information Protection sensitivity labels. Establish clear policies detailing how data in each classification level must be handled regarding access, sharing, encryption, retention, and disposal. For example, it mandates that all data classified as “Highly Restricted” must be encrypted and access restricted to specific roles. https://learn.microsoft.com/en-us/purview/sensitivity-labels
Configure Tenant Settings via Admin Portal: Fabric administrators should configure tenant-wide governance settings in the Fabric Admin Portal. This includes defining who can create workspaces, setting default sharing behaviors, enabling auditing, configuring capacity settings, and potentially restricting specific Fabric experiences. Many settings can be delegated to domain or capacity admins, where appropriate, for more granular control. Consider licensing requirements for advanced Purview features like automated labeling or DLP. https://learn.microsoft.com/en-us/fabric/admin/about-tenant-settings
Document and Communicate: Document all governance policies, standards, and procedures. Make this documentation easily accessible to all Fabric users. Communicate the policies effectively, explaining their rationale and clarifying user responsibilities. Assign clear accountability for policy enforcement, often involving data stewards, data owners, and workspace administrators.
Step 2: Establish Roles and Access Controls (RBAC)
With policies defined, implement Role-Based Access Control (RBAC) to enforce them.
Utilize Workspace Roles: Assign users or (preferably) security groups to Fabric workspace roles (Admin, Member, Contributor, Viewer) based on the principle of least privilege. Understand the permissions associated with each role to ensure users only have the access necessary for their jobs. https://learn.microsoft.com/en-us/fabric/fundamentals/roles-workspaces
Leverage Security Groups: Manage access primarily through Microsoft Entra ID (formerly Azure AD) security groups rather than individual user assignments. This simplifies administration as team memberships change.
Assign Admin Roles: Carefully assign higher-level administrative roles: Fabric Administrator (tenant-wide), Domain Administrator (for specific business areas), and Capacity Administrator (for managing compute resources)ādelegate responsibilities where appropriate to distribute the governance workload. https://learn.microsoft.com/en-us/fabric/admin/roles
Establish Access Review Processes: Implement procedures for requesting, approving, and periodically reviewing access permissions, especially for sensitive data or privileged roles. Maintain logs of approvals for audit purposes.
Step 3: Configure Workspaces and Domains
Organize your Fabric environment logically to support governance.
Structure Domains: Group workspaces into logical domains, typically aligned with business units or subject areas (e.g., Finance, Marketing, Product Analytics). This facilitates delegated administration and helps users discover relevant data. https://learn.microsoft.com/en-us/fabric/governance/domains
Organize Workspaces: Within domains, organize workspaces based on purpose (e.g., project, team) or environment (Development, Test, Production). Use clear naming conventions and descriptions. Assign workspaces to the appropriate domain. https://learn.microsoft.com/en-us/fabric/fundamentals/workspaces
Apply Workspace Settings: Configure settings within each workspace, such as contact lists, license modes (Pro, PPU, Fabric capacity), and connections to resources like Git for version control, aligning them with your governance policies.
Step 4: Implement Data Protection and Security Measures
Actively protect your data assets using built-in and integrated tools.
Apply Sensitivity Labels: Implement the data classification policy by applying Microsoft Purview Information Protection sensitivity labels to Fabric items (datasets, reports, lakehouses, etc.). Use a combination of manual labeling by users, default labeling on workspaces or items, and automated labeling based on sensitive information types detected by Purview scanners. Ensure label inheritance policies are configured appropriately. https://learn.microsoft.com/en-us/power-bi/enterprise/service-security-enable-data-sensitivity-labels
Configure Data Loss Prevention (DLP) Policies: Define and enable Microsoft Purview DLP policies specifically for Power BI (and potentially other Fabric endpoints as capabilities expand) to detect and prevent the inappropriate sharing or exfiltration of sensitive data identified by sensitivity labels. (Note: Requires specific Purview licensing.) https://learn.microsoft.com/en-us/fabric/governance/data-loss-prevention-configure
Leverage Encryption: Understand and utilize Fabric’s encryption capabilities, including encryption at rest (often managed by the platform) and potentially customer-managed keys (CMK) for enhanced control over encryption if required. https://learn.microsoft.com/en-us/fabric/security/security-scenario
Step 5: Enable Monitoring and Auditing
Visibility into data usage and governance activities is crucial.
Enable and Review Audit Logs: Ensure Fabric auditing is enabled and integrated with the Microsoft Purview compliance portal. Regularly review audit logs to track user activities, access patterns, policy changes, and potential security incidents. https://learn.microsoft.com/en-us/fabric/admin/track-user-activities
Implement Endorsement: Establish a straightforward process for promoting and certifying high-quality, reliable data assets (datasets, dataflows, reports). Clearly define the criteria for certification and authorize specific reviewers. This builds user trust in endorsed assets. https://learn.microsoft.com/en-us/fabric/governance/endorsement-overview
Leverage Lineage and Impact Analysis: Use Fabric’s lineage view to understand data origins, transformations, and dependencies. Before changing upstream items, perform an impact analysis to understand potential effects on downstream reports or processes. https://learn.microsoft.com/en-us/fabric/governance/lineage
Gather Feedback: Solicit feedback from users and stakeholders on the governance processes and tools.
Adapt and Update: Update policies and configurations based on audit findings, user feedback, changing regulations, and evolving business needs. Stay informed about new Fabric and Purview governance features.
By following these steps, organizations can establish a comprehensive and practical data governance framework within Microsoft Fabric, enabling them to harness the full power of the platform while maintaining control, security, and compliance.
Real-World Examples: Data Governance in Action
The principles and steps outlined above are not just theoretical; organizations are actively implementing robust data governance frameworks using Microsoft Fabric and Purview to overcome challenges and drive value. Let’s look at a couple of examples:
Microsoft itself faced significant hurdles with its vast and complex data estate. Data was siloed across various business units and managed inconsistently, making it difficult to gain a unified enterprise view. Governance was often perceived as a bottleneck, hindering the pace of digital transformation. Microsoft embarked on its data transformation journey, leveraging its tools to address this.
Their strategy involved building an enterprise data platform centered around Microsoft Fabric as the unifying analytics foundation and Microsoft Purview for governance. Fabric helped break down silos by providing a common platform (including OneLake) for data integration and analytics across diverse sources. Purview was then layered on top to enable responsible data democratization. This meant implementing controls like a shared data catalog and consistent policies, not to restrict access arbitrarily, but to enable broader, secure access to trustworthy data. A key cultural shift was viewing governance as an accelerator for transformation, facilitated by the unified data strategy and strong leadership alignment. The outcome is a more agile, regulated, and business-focused data environment that fuels faster decision-making and innovation.
A leading bank operating in a highly regulated industry revolutionized its data governance with Microsoft Purview. While specific challenges aren’t detailed in the summary, typical banking concerns include operational efficiency, stringent compliance requirements (like GDPR), data security, and preventing sensitive data loss.
By implementing Purview, the bank achieved significant improvements. Operationally, automated data discovery and a centralized view allowed business users to find information faster and reduced manual effort in reporting. From a compliance perspective, Purview provided centralized metrics for monitoring the compliance posture and automated processes for classifying and tagging data according to regulations, strengthening overall security. Furthermore, implementing Data Loss Prevention (DLP) rules based on data sensitivity helped safeguard critical information and prevent unauthorized access or sharing. Purview acted as a unified platform, enhancing efficiency, visibility, security, and control over the bank’s data assets.
These examples illustrate how organizations, facing everyday challenges like data silos, compliance pressures, and the need for agility, are successfully using Microsoft Fabric and Purview to establish effective data governance. They highlight the importance of a unified data strategy, the role of tools in automating and centralizing controls, and the cultural shift towards viewing governance as an enabler of business value.
Conclusion
Microsoft Fabric offers a robust, unified platform for end-to-end analytics, but realizing its full potential requires a deliberate and comprehensive approach to data governance. As we’ve explored, implementing governance in Fabric is not merely about restricting access; it’s about establishing a framework that ensures data quality, security, compliance, and usability, fostering trust and enabling confident, data-driven decision-making across the organization.
The real-world examples, from Microsoft’s internal transformation to implementations in regulated industries like finance, demonstrate that these are not just theoretical concepts. Organizations are actively leveraging Fabric’s unified foundation and Purview’s comprehensive governance capabilities to overcome tangible challenges like data silos, inconsistent management, compliance burdens, and operational inefficiencies.
By integrating Fabric’s built-in featuresāsuch as the Admin Portal, domains, workspaces, RBAC, endorsement, and lineageāwith the advanced capabilities of Microsoft Purviewāincluding Information Protection sensitivity labels, Data Loss Prevention, auditing, and the unified data catalogāorganizations can create a robust governance posture tailored to their specific needs.
The outlined step-by-step process provides a roadmap, but the journey requires more than technical implementation. Success hinges on several key factors, reinforced by real-world experience:
Key Recommendations for Success:
Strategic Alignment and Collaboration: As seen in Microsoft’s case, define clear governance objectives that are aligned with business goals before configuring tools. Data governance requires a cultural shift and strong leadership alignment. It’s a team effort involving IT, data, legal, compliance, and business units.
Leverage the Unified Platform (Fabric + Purview): Treat Fabric and Purview as an integrated solution. Use Fabric to unify the data estate and Purview to apply consistent governance controls across it, enabling responsible democratization and breaking down silos.
Prioritize Automation for Efficiency and Consistency: Automate governance tasks like sensitivity labeling, policy enforcement (DLP), and monitoring wherever possible. As the banking case study demonstrated, this reduces manual effort, ensures consistency, improves responsiveness, and boosts operational efficiency.
Focus on User Empowerment and Education: Balance control with usability. Provide clear documentation, training, and tools (like the OneLake Data Hub and Purview catalog) to help users understand policies, find trustworthy data, and comply with requirements ā turning governance into an accelerator, not a blocker.
Implement Incrementally and Iterate: Data governance is an ongoing journey. Start with a pilot or focus on critical assets first. Monitor effectiveness, gather feedback, and continuously refine your approach based on evolving needs, regulations, and platform capabilities.
By taking a structured, collaborative, and tool-aware approach, informed by others’ successes, organizations can build a foundation of trust and control within Microsoft Fabric, transforming governance from a perceived burden into a strategic enabler that unlocks the actual value of their data.
Should you have any questions or need assistance about Microsoft Fabric or Microsoft Purview, please don’t hesitate to contact me using the provided link: https://lawrence.eti.br/contact/
In the modern digital era, the importance of streamlined data preparation cannot be emphasized enough. For data scientists and analysts, a large portion of time is dedicated to data cleansing and preparation, often termed ‘wrangling.’ Microsoft’s introduction of Data Wrangler in its Fabric suite seems like an answer to this age-old challenge. It promises Power Query’s intuitiveness and Python code outputs’ flexibility. Dive in to uncover the magic of this new tool.
Data preparation is a time-consuming and error-prone task. It often involves cleaning, transforming, and merging data from multiple sources. This can be a daunting task, even for experienced data scientists.
What is Data Wrangler?
Data Wrangler is a state-of-the-art tool Microsoft offers in its Fabric suite explicitly designed for data professionals. At its core, it aims to simplify the data preparation process by automating tedious tasks. Much like Power Query, it offers a user-friendly interface, but what sets it apart is that it can generate Python code as an output. As users interact with the GUI, Python code snippets are generated behind the scenes, making integrating various data science workflows easier.
Advantages of Data Wrangler
User-Friendly Interface: Offers an intuitive GUI for those not comfortable with coding.
Python Code Output: Generates Python code in real-time, allowing flexibility and easy integration.
Time-Saving: Reduces the time spent on data preparation dramatically.
Replicability: Since Python code is generated, it ensures replicable data processing steps.
Integration with Fabric Suite: Can be effortlessly integrated with other tools within the Microsoft Fabric suite.
No-code to Low-code Transition: Ideal for those wanting to transition from a no-code environment to a more code-centric one.
How to use Data Wrangler?
You have to click on Data Science inside the Power BI Service.
You have to select the Notebook button.
You have to insert this code above after the upload of the CSV file in the LakeHouse.
Python
import pandas as pd# Read a CSV into a Pandas DataFrame from e.g. a public blob storedf = pd.read_csv("/lakehouse/default/Files/Top_1000_Companies_Dataset.csv")
You have to click in the Lauch Data Wrangler and then select the data frame “df”.
On this screen, you can do all transformations you need.
In the end this code will be generate.
Python
# Code generated by Data Wrangler for pandas DataFramedefclean_data(df):# Drop columns: 'company_name', 'url' and 6 other columns df = df.drop(columns=['company_name', 'url', 'city', 'state', 'country', 'employees', 'linkedin_url', 'founded'])# Drop columns: 'GrowjoRanking', 'Previous Ranking' and 10 other columns df = df.drop(columns=['GrowjoRanking', 'Previous Ranking', 'job_openings', 'keywords', 'LeadInvestors', 'Accelerator', 'valuation', 'btype', 'total_funding', 'product_url', 'growth_percentage', 'contact_info'])# Drop column: 'indeed_url' df = df.drop(columns=['indeed_url'])# Performed 1 aggregation grouped on column: 'Industry' df = df.groupby(['Industry']).agg(estimated_revenues_sum=('estimated_revenues', 'sum')).reset_index()# Sort by column: 'estimated_revenues_sum' (descending) df = df.sort_values(['estimated_revenues_sum'], ascending=[False])return dfdf_clean = clean_data(df.copy())df_clean.head()
After that, you can create or add to a pipeline or schedule a moment to execute this transformation automatically.
Data Wrangler Extension for Visual Studio Code
Data Wrangler is a code-centric data cleaning tool integrated into VS Code and Jupyter Notebooks. Data Wrangler aims to increase the productivity of data scientists doing data cleaning by providing a rich user interface that automatically generates Pandas code and shows insightful column statistics and visualizations.
This document will cover how to:
Install and setup Data Wrangler
Launch Data Wrangler from a notebook
Use Data Wrangler to explore your data
Perform operations on your data
Edit and export code for data wrangling to a notebook
Troubleshooting and providing feedback
Setting up your environment
If you have not already done so, install Python. IMPORTANT: Data Wrangler only supports Python version 3.8 or higher.
Install the Data Wrangler extension for VS Code from the Visual Studio Marketplace. For additional details on installing extensions, see Extension Marketplace. The Data Wrangler extension is named Data Wrangler, and Microsoft publishes it.
When you launch Data Wrangler for the first time, it will ask you which Python kernel you would like to connect to. It will also check your machine and environment to see if any required Python packages are installed (e.g., Pandas).
Here is a list of the required versions for Python and Python packages, along with whether they are automatically installed by Data Wrangler:
Name
Minimum required version
Automatically installed
Python
3.8
No
pandas
0.25.2
Yes
regex*
2020.11.13
Yes
* We use the open-source regex package to be able to use Unicode properties (for example, /\p{Lowercase_Letter}/), which aren’t supported by Python’s built-in regex module (re). Unicode properties make it easier and cleaner to support foreign characters in regular expressions.
If they are not found in your environment, Data Wrangler will attempt to install them for you via pip. If Data Wrangler cannot install dependencies, the easiest workaround is to run the pip install and then relaunch Data Wrangler manually. These dependencies are required for Data Wrangler such that it can generate Python and Pandas code.
Connecting to a Python kernel
There are currently two ways to connect to a Python kernel, as shown in the quick pick below.
1. Connect using a local Python interpreter
If this option is selected, the kernel connection is created using the Jupyter and Python extensions. We recommend this option for a simple setup and a quick way to start with Data Wrangler.
2. Connect using Jupyter URL and token
A kernel connection is created using JupyterLab APIs if this option is selected. Note that this option has performance benefits since it bypasses some initialization and kernel discovery processes. However, it will also require separate Jupyter Notebook server user management. We recommend this option generally in two cases: 1) if there are blocking issues in the first method and 2) for power users who would like to reduce the cold-start time of Data Wrangler.
To set up a Jupyter Notebook server and use it with this option, follow the steps below:
Install Jupyter. We recommend installing the accessible version of Anaconda with Jupyter installed. Alternatively, follow the official instructions to install it.
In the appropriate environment (e.g., in an Anaconda prompt if Anaconda is used), launch the server with the following command (replace the jupyter token with your secure token): jupyter notebook --no-browser --NotebookApp.token='<your-jupyter-token>'
In Data Wrangler, connect using the address of the spawned server. E.g., http://localhost:8888, and pass in the token used in the previous step. Once configured, this information is cached locally and can automatically be reused for future connections.
Launching Data Wrangler
Once Data Wrangler has been successfully installed, there are 2 ways to launch it in VS Code.
Launching Data Wrangler from a Jupyter Notebook
If you are in a Jupyter Notebook working with Pandas data frames, youāll now see a āLaunch Data Wranglerā button appear after running specific operations on your data frame, such as df.head(). Clicking the button will open a new tab in VS Code with the Data Wrangler interface in a sandboxed environment.
Important note: We currently only accept the following formats for launching:
df
df.head()
df.tail()
Where df is the name of the data frame variable. The code above should appear at the end of a cell without any comments or other code after it.
Launching Data Wrangler directly from a CSV file
You can also launch Data Wrangler directly from a local CSV file. To do so, open any VS Code folder with the CSV dataset youād like to explore. In the File Explorer panel, right-click the. CSV dataset and click āOpen in Data Wrangler.ā
Using Data Wrangler
The Data Wrangler interface is divided into 6 components, described below.
The Quick Insights header lets you quickly see valuable information about each column. Depending on the column’s datatype, Quick Insights will show the distribution of the data, the frequency of data points, and missing and unique values.
The Data Grid gives you a scrollable pane to view your entire dataset. Additionally, when selecting an operation to perform, a preview will be illustrated in the data grid, highlighting the modified columns.
The Operations Panel is where you can search through Data Wranglerās built-in data operations. The operations are organized by their top-level category.
The Summary Panel shows detailed summary statistics for your dataset or a specific column if one is selected. Depending on the data type, it will show information such as min, max values, datatype of the column, skew, and more.
The Operation History Panel shows a human-readable list of all the operations previously applied in the current Data Wrangling session. It enables users to undo specific operations or edit the most recent operation. Selecting a step will highlight the data grid changes and show the generated code associated with that operation.
The Code Preview section will show the Python and Pandas code that Data Wrangler has generated when an operation is selected. It will remain blank when no operation is selected. The code can even be edited by the user, and the data grid will highlight the effect on the data.
Example: Filtering a column
Letās go through a simple example using Data Wrangler with the Titanic dataset to filter adult passengers on the ship.
Weāll start by looking at the quick insights of the Age column, and weāll notice the distribution of the ages and that the minimum age is 0.42. For more information, we can glance at the Summary panel to see that the datatype is a float, along with additional statistics such as the passengers’ mean and median age.
To filter for only adult passengers, we can go to the Operation Panel and search for the keyword āFilterā to find the Filter operation. (You can also expand the āSort and filterā category to find it.)
Once we select an operation, we are brought into the Operation Preview state, where parameters can be modified to see how they affect the underlying dataset before applying the operation. In this example, we want to filter the dataset only to include adults, so weāll want to filter the Age column to only include values greater than or equal to 18.
Once the parameters are entered in the operation panel, we can see a preview of what will happen to the data. Weāll notice that the minimum value in age is now 18 in the Quick Insights, along with a visual preview of the rows that are being removed, highlighted in red. Finally, weāll also notice the Code Preview section automatically shows the code that Data Wrangler produced to execute this Filter operation. We can edit this code by changing the filtered age to 21, and the data grid will automatically update accordingly.
After confirming that the operation has the intended effect, we can click Apply.
Editing and exporting code
Each step of the generated code can be modified. Changes to the data will be highlighted in the grid view as you make changes.
Once youāre done with your data cleaning steps in Data Wrangler, there are 3 ways to export your cleaned dataset from Data Wrangler.
Export code back to Notebook and exit: This creates a new cell in your Jupyter Notebook with all the data cleaning code you generated packaged into a clean Python function.
Export data as CSV: This saves the cleaned dataset as a new CSV file onto your machine.
Copy code to clipboard: This copies all the code generated by Data Wrangler for the data cleaning operations.
Note: If you launched Data Wrangler directly from a CSV, the first export option will be to export the code into a new Jupyter Notebook.
Data Wrangler operations
These are the Data Wrangler operations currently supported in the initial launch of Data Wrangler (with many more to be added soon).
Operation
Description
Sort values
Sort column(s) ascending or descending
Filter
Filter rows based on one or more conditions
Calculate text length
Create new column with values equal to the length of each string value in a text column
One-hot encode
Split categorical data into a new column for each category
Multi-label binarizer
Split categorical data into a new column for each category using a delimiter
Create column from formula
Create a column using a custom Python formula
Change column type
Change the data type of a column
Drop column
Delete one or more columns
Select column
Choose one or more columns to keep and delete the rest
Rename column
Rename one or more columns
Drop missing values
Remove rows with missing values
Drop duplicate rows
Drops all rows that have duplicate values in one or more columns
Fill missing values
Replace cells with missing values with a new value
Find and replace
Split a column into several columns based on a user-defined delimiter
Group by column and aggregate
Group by columns and aggregate results
Strip whitespace
Capitalize the first character of a string with the option to apply to all words.
Split text
Remove whitespace from the beginning and end of the text
Convert text to capital case
Automatically create a column when a pattern is detected from your examples.
Convert text to lowercase
Convert text to lowercase
Convert text to uppercase
Convert text to UPPERCASE
String transform by example
Automatically perform string transformations when a pattern is detected from the examples you provide
DateTime formatting by example
Automatically perform DateTime formatting when a pattern is detected from the examples you provide
New column by example
Automatically create a column when a pattern is detected from the examples you provide.
Scale min/max values
Scale a numerical column between a minimum and maximum value
Custom operation
Automatically create a new column based on examples and the derivation of existing column(s)
Limitations
Data Wrangler currently supports only Pandas DataFrames. Support for Spark DataFrames is in progress. Data Wrangler’s display works better on large monitors, although different interface portions can be minimized or hidden to accommodate smaller screens.
Conclusion
Data Wrangler in Microsoft Fabric is undeniably a game-changer in data preparation. It combines the best of both worlds by offering the simplicity of Power Query with the robustness and flexibility of Python. As data continues to grow in importance, tools like Data Wrangler that simplify and expedite the data preparation process will be indispensable for organizations aiming to stay ahead.
Microsoft Fabric, launched on May 24-25 of 2023 at the Microsoft Build event, is an end-to-end data and analytics platform that combines Microsoft’s OneLake data lake, Power BI, Azure Synapse, and Azure Data Factory into a unified software as a service (SaaS) platform. It’s a one-stop solution designed to serve various data professionals including data engineers, data warehousing professionals, data scientists, data analysts, and business users, enabling them to collaborate effectively within the platform to foster a healthy data culture across their organizationsāā.
What are the Microsoft Fabric key features?
Data Factory ā Microsoftās Azure Data Factory is a powerful tool that combines the simplicity of Power Query with Azure Data Factoryās scale. It provides over 200 native connectors for data linkage from on-premises and cloud-based sources. Data Factory enables the scheduling and orchestration of notebooks and Spark jobs.
Data Engineering ā Leveraging the extensive capabilities of Spark, data engineering in Microsoft Fabric provides premier authoring experiences and facilitates large-scale data transformations. It plays a crucial role in democratizing data through the lakehouse model. Moreover, integration with
Data Science ā The data science capability in Microsoft Fabric aids in building, deploying, and operationalizing machine learning models within the Fabric framework. It interacts with Azure Machine Learning for built-in experiment tracking and model registry, empowering data scientists to enhance organizational data with predictions that business analysts can incorporate into their BI reports, thereby transitioning from descriptive to predictive insights.
Data Warehouse ā The data warehousing component of Microsoft Fabric offers top-tier SQL performance and scalability. It features a full separation of computing and storage for independent scaling and native data storage in the open Delta Lake format.
Real-Time Analytics ā Observational data, acquired from diverse sources like apps, IoT devices, human interactions, and more, represents the fastest-growing data category. This semi-structured, high-volume data, often in JSON or Text format with varying schemas, presents challenges for conventional data warehousing platforms. However, Microsoft Fabricās Real-Time Analytics offers a superior solution for analyzing such data.
Power BI ā Recognised as a leading Business Intelligence platform worldwide, Power BI in Microsoft Fabric enables business owners to access all Fabric data swiftly and intuitively for data-driven decision-making.
What are the Advantages of Microsoft Fabric?
Unified Platform: Microsoft Fabric provides a unified platform for different data analytics workloads such as data integration, engineering, warehousing, data science, real-time analytics, and business intelligence. This can foster a well-functioning data culture across the organization as data engineers, warehousing professionals, data scientists, data analysts, and business users can collaborate within Fabricāā.
Multi-cloud Support: Fabric is designed with a multi-cloud approach in mind, with support for data in Amazon S3 and (soon) Google Cloud Platform. This means that users are not restricted to using data only from Microsoft’s ecosystem, providing flexibilityā.
Accessibility: Microsoft Fabric is currently available in public preview, and anyone can try the service without providing their credit card information. Starting from July 1, Fabric will be enabled for all Power BI tenantsā.
AI Integration: The private preview of Copilot in Power BI will combine advanced generative AI with data, enabling users to simply describe the insights they need or ask a question about their data, and Copilot will analyze and pull the correct data into a report, turning data into actionable insights instantlyāā.
Microsoft Fabric – Licensing and Pricing
Microsoft Fabric capacities are available for purchase in the Azure portal. These capacities provide the compute resources for all the experiences in Fabric from the Data Factory to ingest and transform to Data Engineering, Data Science, Data Warehouse, Real-Time Analytics, and all the way to Power BI for data visualization. A single capacity can power all workloads concurrently and does not need to be pre-allocated across the workloads. Moreover, a single capacity can be shared among multiple users and projects, without any limitations on the number of workspaces or creators that can utilize it.
To gain access to Microsoft Fabric, you have three options:
Leverage your existing Power BI Premium subscription by turning on the Fabric preview switch. All Power BI Premium capacities can instantly power all the Fabric workloads with no additional action required.If you already have a Power BI Premium subscription, you can simply turn on the Fabric preview switch. This means you can enable Microsoft Fabric’s capabilities as part of your existing Power BI Premium subscription without having to do anything else. All the capacities you have with your Power BI Premium subscription can be used to power the full range of workloads in Microsoft Fabric. In other words, you can use your existing Power BI Premium resources to run all of the data and analytics tasks that Microsoft Fabric can handle.
Start a Fabric trial if your tenant supports trials. If you’re not sure about committing to Microsoft Fabric yet, you can start a trial if your tenant (an instance of Azure Active Directory) supports it. A trial allows you to test the service before deciding to purchase. During the trial period, you can explore the full capabilities of Microsoft Fabric, such as data ingestion, data transformation, data engineering, data science, data warehouse operations, real-time analytics, and data visualization with Power BI.
Purchase a Fabric pay-as-you-go capacity from the Azure portal. If you decide that Microsoft Fabric suits your needs and you don’t have a Power BI Premium subscription, you can directly purchase a Fabric capacity on a pay-as-you-go basis from the Azure portal. The pay-as-you-go model is flexible because it allows you to pay for only the compute and storage resources you use. Microsoft Fabric capacities come in different sizes, from F2 to F2048, representing 2 ā 2048 Capacity Units (CU). Your bill will be determined by the amount of computing you provision (i.e., the size of the capacity you choose) and the amount of storage you use in OneLake, the data lake built into Microsoft Fabric. This model also allows you to easily scale your capacities up and down to adjust their computing power, and even pause your capacities when not in use to save on your billsāā.
Microsoft Fabric is a unified product for all your data and analytics workloads. Rather than provisioning and managing separate compute for each workload, with Fabric, your bill is determined by two variables: the amount of compute you provision and the amount of storage you use.
Follow the capacities that you can buy in the Azure portal:
Check out this video from Guy and Cube which breaks down the details on pricing and licensing.
How to activate the Microsoft Fabric Trial version?
Step 1
Login to Microsoft Power BI with your Developer Account
You will observe that asides from the OneLake icon at the top left, everything looks normal if you are familiar with Power BI Service.
Step 2
Enable Microsoft Fabric for your Tenant
Your Screen will Look like this
So far, we’ve only enabled Microsoft Fabric at the tenant level. This doesn’t give full access to Fabric resources as can be seen in the illustration below
So, Let’s upgrade the Power BI License to Microsoft Fabric Trial
For a smoother experience, You should create a new Workspace and add Microsoft Fabric Trial License as can be seen below
As you can see, while creating a new Workspace, you can now Assign Fabric Trial License to it. Upon creation, we are able to take full advantage of Microsoft Fabric
This video by Guy and Cube explains the steps for getting the Microsoft Fabric Trial.
Conclusion
Microsoft Fabric is currently in preview but already represents a significant advancement in the field of data and analytics, offering a unified platform that brings together various tools and services. It enables a smooth and collaborative experience for a variety of data professionals, fostering a data-driven culture within organizations. Let“s wait for the next steps from Microsoft.