In the modern digital era, the importance of streamlined data preparation cannot be emphasized enough. For data scientists and analysts, a large portion of time is dedicated to data cleansing and preparation, often termed ‘wrangling.’ Microsoft’s introduction of Data Wrangler in its Fabric suite seems like an answer to this age-old challenge. It promises Power Query’s intuitiveness and Python code outputs’ flexibility. Dive in to uncover the magic of this new tool.

Data preparation is a time-consuming and error-prone task. It often involves cleaning, transforming, and merging data from multiple sources. This can be a daunting task, even for experienced data scientists.

What is Data Wrangler?

Data Wrangler is a state-of-the-art tool Microsoft offers in its Fabric suite explicitly designed for data professionals. At its core, it aims to simplify the data preparation process by automating tedious tasks. Much like Power Query, it offers a user-friendly interface, but what sets it apart is that it can generate Python code as an output. As users interact with the GUI, Python code snippets are generated behind the scenes, making integrating various data science workflows easier.

Advantages of Data Wrangler

User-Friendly Interface: Offers an intuitive GUI for those not comfortable with coding.
Python Code Output: Generates Python code in real-time, allowing flexibility and easy integration.
Time-Saving: Reduces the time spent on data preparation dramatically.
Replicability: Since Python code is generated, it ensures replicable data processing steps.
Integration with Fabric Suite: Can be effortlessly integrated with other tools within the Microsoft Fabric suite.
No-code to Low-code Transition: Ideal for those wanting to transition from a no-code environment to a more code-centric one.

How to use Data Wrangler?

You have to click on Data Science inside the Power BI Service.

You have to select the Notebook button.

You have to insert this code above after the upload of the CSV file in the LakeHouse.

Python

import pandas as pd

# Read a CSV into a Pandas DataFrame from e.g. a public blob store
df = pd.read_csv("/lakehouse/default/Files/Top_1000_Companies_Dataset.csv")

You have to click in the Lauch Data Wrangler and then select the data frame “df”.

On this screen, you can do all transformations you need.

In the end this code will be generate.

Python

# Code generated by Data Wrangler for pandas DataFrame

def clean_data(df):
    # Drop columns: 'company_name', 'url' and 6 other columns
    df = df.drop(columns=['company_name', 'url', 'city', 'state', 'country', 'employees', 'linkedin_url', 'founded'])
    # Drop columns: 'GrowjoRanking', 'Previous Ranking' and 10 other columns
    df = df.drop(columns=['GrowjoRanking', 'Previous Ranking', 'job_openings', 'keywords', 'LeadInvestors', 'Accelerator', 'valuation', 'btype', 'total_funding', 'product_url', 'growth_percentage', 'contact_info'])
    # Drop column: 'indeed_url'
    df = df.drop(columns=['indeed_url'])
    # Performed 1 aggregation grouped on column: 'Industry'
    df = df.groupby(['Industry']).agg(estimated_revenues_sum=('estimated_revenues', 'sum')).reset_index()
    # Sort by column: 'estimated_revenues_sum' (descending)
    df = df.sort_values(['estimated_revenues_sum'], ascending=[False])
    return df

df_clean = clean_data(df.copy())
df_clean.head()

After that, you can create or add to a pipeline or schedule a moment to execute this transformation automatically.

Data Wrangler Extension for Visual Studio Code

Data Wrangler is a code-centric data cleaning tool integrated into VS Code and Jupyter Notebooks. Data Wrangler aims to increase the productivity of data scientists doing data cleaning by providing a rich user interface that automatically generates Pandas code and shows insightful column statistics and visualizations.

This document will cover how to:

Install and setup Data Wrangler
Launch Data Wrangler from a notebook
Use Data Wrangler to explore your data
Perform operations on your data
Edit and export code for data wrangling to a notebook
Troubleshooting and providing feedback

Setting up your environment

If you have not already done so, install Python.
IMPORTANT: Data Wrangler only supports Python version 3.8 or higher.
Install Visual Studio Code.
Install the Data Wrangler extension for VS Code from the Visual Studio Marketplace. For additional details on installing extensions, see Extension Marketplace. The Data Wrangler extension is named Data Wrangler, and Microsoft publishes it.

When you launch Data Wrangler for the first time, it will ask you which Python kernel you would like to connect to. It will also check your machine and environment to see if any required Python packages are installed (e.g., Pandas).

Here is a list of the required versions for Python and Python packages, along with whether they are automatically installed by Data Wrangler:

Name Minimum required version Automatically installed
Python 3.8 No
pandas 0.25.2 Yes
regex* 2020.11.13 Yes

* We use the open-source regex package to be able to use Unicode properties (for example, /\p{Lowercase_Letter}/), which aren’t supported by Python’s built-in regex module (re). Unicode properties make it easier and cleaner to support foreign characters in regular expressions.

Name	Minimum required version	Automatically installed
Python	3.8	No
pandas	0.25.2	Yes
regex*	2020.11.13	Yes

If they are not found in your environment, Data Wrangler will attempt to install them for you via pip. If Data Wrangler cannot install dependencies, the easiest workaround is to run the pip install and then relaunch Data Wrangler manually. These dependencies are required for Data Wrangler such that it can generate Python and Pandas code.

Connecting to a Python kernel

There are currently two ways to connect to a Python kernel, as shown in the quick pick below.

1. Connect using a local Python interpreter

If this option is selected, the kernel connection is created using the Jupyter and Python extensions. We recommend this option for a simple setup and a quick way to start with Data Wrangler.

2. Connect using Jupyter URL and token

A kernel connection is created using JupyterLab APIs if this option is selected. Note that this option has performance benefits since it bypasses some initialization and kernel discovery processes. However, it will also require separate Jupyter Notebook server user management. We recommend this option generally in two cases: 1) if there are blocking issues in the first method and 2) for power users who would like to reduce the cold-start time of Data Wrangler.

To set up a Jupyter Notebook server and use it with this option, follow the steps below:

Install Jupyter. We recommend installing the accessible version of Anaconda with Jupyter installed. Alternatively, follow the official instructions to install it.
In the appropriate environment (e.g., in an Anaconda prompt if Anaconda is used), launch the server with the following command (replace the jupyter token with your secure token):
jupyter notebook --no-browser --NotebookApp.token='<your-jupyter-token>'
In Data Wrangler, connect using the address of the spawned server. E.g., http://localhost:8888, and pass in the token used in the previous step. Once configured, this information is cached locally and can automatically be reused for future connections.

Launching Data Wrangler

Once Data Wrangler has been successfully installed, there are 2 ways to launch it in VS Code.

Launching Data Wrangler from a Jupyter Notebook

If you are in a Jupyter Notebook working with Pandas data frames, you’ll now see a “Launch Data Wrangler” button appear after running specific operations on your data frame, such as df.head(). Clicking the button will open a new tab in VS Code with the Data Wrangler interface in a sandboxed environment.

Important note:
We currently only accept the following formats for launching:

df
df.head()
df.tail()

Where df is the name of the data frame variable. The code above should appear at the end of a cell without any comments or other code after it.

Launching Data Wrangler directly from a CSV file

You can also launch Data Wrangler directly from a local CSV file. To do so, open any VS Code folder with the CSV dataset you’d like to explore. In the File Explorer panel, right-click the. CSV dataset and click “Open in Data Wrangler.”

Using Data Wrangler

The Data Wrangler interface is divided into 6 components, described below.

The Quick Insights header lets you quickly see valuable information about each column. Depending on the column’s datatype, Quick Insights will show the distribution of the data, the frequency of data points, and missing and unique values.

The Data Grid gives you a scrollable pane to view your entire dataset. Additionally, when selecting an operation to perform, a preview will be illustrated in the data grid, highlighting the modified columns.

The Operations Panel is where you can search through Data Wrangler’s built-in data operations. The operations are organized by their top-level category.

The Summary Panel shows detailed summary statistics for your dataset or a specific column if one is selected. Depending on the data type, it will show information such as min, max values, datatype of the column, skew, and more.

The Operation History Panel shows a human-readable list of all the operations previously applied in the current Data Wrangling session. It enables users to undo specific operations or edit the most recent operation. Selecting a step will highlight the data grid changes and show the generated code associated with that operation.

The Code Preview section will show the Python and Pandas code that Data Wrangler has generated when an operation is selected. It will remain blank when no operation is selected. The code can even be edited by the user, and the data grid will highlight the effect on the data.

Example: Filtering a column

Let’s go through a simple example using Data Wrangler with the Titanic dataset to filter adult passengers on the ship.

We’ll start by looking at the quick insights of the Age column, and we’ll notice the distribution of the ages and that the minimum age is 0.42. For more information, we can glance at the Summary panel to see that the datatype is a float, along with additional statistics such as the passengers’ mean and median age.

To filter for only adult passengers, we can go to the Operation Panel and search for the keyword “Filter” to find the Filter operation. (You can also expand the “Sort and filter” category to find it.)

Once we select an operation, we are brought into the Operation Preview state, where parameters can be modified to see how they affect the underlying dataset before applying the operation. In this example, we want to filter the dataset only to include adults, so we’ll want to filter the Age column to only include values greater than or equal to 18.

Once the parameters are entered in the operation panel, we can see a preview of what will happen to the data. We’ll notice that the minimum value in age is now 18 in the Quick Insights, along with a visual preview of the rows that are being removed, highlighted in red. Finally, we’ll also notice the Code Preview section automatically shows the code that Data Wrangler produced to execute this Filter operation. We can edit this code by changing the filtered age to 21, and the data grid will automatically update accordingly.

After confirming that the operation has the intended effect, we can click Apply.

Editing and exporting code

Each step of the generated code can be modified. Changes to the data will be highlighted in the grid view as you make changes.

Once you’re done with your data cleaning steps in Data Wrangler, there are 3 ways to export your cleaned dataset from Data Wrangler.

Export code back to Notebook and exit: This creates a new cell in your Jupyter Notebook with all the data cleaning code you generated packaged into a clean Python function.
Export data as CSV: This saves the cleaned dataset as a new CSV file onto your machine.
Copy code to clipboard: This copies all the code generated by Data Wrangler for the data cleaning operations.

Note: If you launched Data Wrangler directly from a CSV, the first export option will be to export the code into a new Jupyter Notebook.

Data Wrangler operations

These are the Data Wrangler operations currently supported in the initial launch of Data Wrangler (with many more to be added soon).

Operation	Description
Sort values	Sort column(s) ascending or descending
Filter	Filter rows based on one or more conditions
Calculate text length	Create new column with values equal to the length of each string value in a text column
One-hot encode	Split categorical data into a new column for each category
Multi-label binarizer	Split categorical data into a new column for each category using a delimiter
Create column from formula	Create a column using a custom Python formula
Change column type	Change the data type of a column
Drop column	Delete one or more columns
Select column	Choose one or more columns to keep and delete the rest
Rename column	Rename one or more columns
Drop missing values	Remove rows with missing values
Drop duplicate rows	Drops all rows that have duplicate values in one or more columns
Fill missing values	Replace cells with missing values with a new value
Find and replace	Split a column into several columns based on a user-defined delimiter
Group by column and aggregate	Group by columns and aggregate results
Strip whitespace	Capitalize the first character of a string with the option to apply to all words.
Split text	Remove whitespace from the beginning and end of the text
Convert text to capital case	Automatically create a column when a pattern is detected from your examples.
Convert text to lowercase	Convert text to lowercase
Convert text to uppercase	Convert text to UPPERCASE
String transform by example	Automatically perform string transformations when a pattern is detected from the examples you provide
DateTime formatting by example	Automatically perform DateTime formatting when a pattern is detected from the examples you provide
New column by example	Automatically create a column when a pattern is detected from the examples you provide.
Scale min/max values	Scale a numerical column between a minimum and maximum value
Custom operation	Automatically create a new column based on examples and the derivation of existing column(s)

Limitations

Data Wrangler currently supports only Pandas DataFrames. Support for Spark DataFrames is in progress.
Data Wrangler’s display works better on large monitors, although different interface portions can be minimized or hidden to accommodate smaller screens.

Conclusion

Data Wrangler in Microsoft Fabric is undeniably a game-changer in data preparation. It combines the best of both worlds by offering the simplicity of Power Query with the robustness and flexibility of Python. As data continues to grow in importance, tools like Data Wrangler that simplify and expedite the data preparation process will be indispensable for organizations aiming to stay ahead.

That’s it for today!

Sources:

https://medium.com/towards-data-engineering/data-wrangler-in-fabric-simplifying-data-prep-with-no-code-ab4fe7429b49

https://radacad.com/fabric-data-wrangler-a-tool-for-data-scientist

https://learn.microsoft.com/en-us/fabric/data-science/data-wrangler

https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler

https://github.com/microsoft/vscode-data-wrangler

Author: Lawrence Teixeira

With over 30 years of expertise in the Technology sector and 18 years in leadership roles as a CTO/CIO, he excels at spearheading the development and implementation of strategic technological initiatives, focusing on system projects, advanced data analysis, Business Intelligence (BI), and Artificial Intelligence (AI). Holding an MBA with a specialization in Strategic Management and AI, along with a degree in Information Systems, he demonstrates an exceptional ability to synchronize cutting-edge technologies with efficient business strategies, fostering innovation and enhancing organizational and operational efficiency. His experience in managing and implementing complex projects is vast, utilizing various methodologies and frameworks such as PMBOK, Agile Methodologies, Waterfall, Scrum, Kanban, DevOps, ITIL, CMMI, and ISO/IEC 27001, to lead data and technology projects. His leadership has consistently resulted in tangible improvements in organizational performance. At the core of his professional philosophy is the exploration of the intersection between data, technology, and business, aiming to unleash innovation and create substantial value by merging advanced data analysis, BI, and AI with a strategic business vision, which he believes is crucial for success and efficiency in any organization. View all posts by Lawrence Teixeira

Data Wrangler in Microsoft Fabric: A New Tool for Accelerating Data Preparation. Experience the Power Query Feel but with Python Code Output

Data Wrangler Extension for Visual Studio Code