MemGPT: Unlimited Memory without Token Constraints for Generative AI Platforms, like GPT-4, LaMDA, PaLM, LLAMA, CLAUDE, and others

The field of conversational AI has witnessed a substantial transformation with the emergence of large language models (LLMs) such as GPT-4, LaMDA, PaLM, LLAMA, CLAUDE, and others. These sophisticated models, founded on transformer architectures, have redefined the possibilities of natural language processing, paving the way for a myriad of applications across both consumer and enterprise sectors. However, despite this leap forward, LLMs are still bound by a significant limitation—their context window size. This bottleneck restricts their ability to manage extended dialogues and analyze lengthy documents efficiently. But what if there was a way to circumvent this limitation?

What is MemGPT?

MemGPT, standing for Memory-GPT, is a system devised to enhance the performance of Large Language Models (LLMs) by introducing a more advanced memory management scheme, helping to overcome the challenges posed by fixed context windows. Below are some of the key features of MemGPT:

  1. Memory Management: MemGPT incorporates a tiered memory system into a fixed-context LLM processor, granting it the ability to manage its own memory. By intelligently handling different memory tiers, it extends the context available within the limited context window of the LLM, addressing the issue of constrained context windows common in large language models.
  2. Virtual Context Management: MemGPT introduces a method known as virtual context management. This is a key feature that assists in managing the context windows of LLMs.
  3. Operating System-Inspired: The architecture of MemGPT draws inspiration from traditional operating systems, especially their hierarchical memory systems that facilitate data movement between fast and slow memory. This approach enables effective memory resource management, similar to how operating systems provide the illusion of large memory resources to applications through virtual memory paging.
  4. Interruption Handling: MemGPT employs interrupts to manage the control flow between itself and the user, ensuring smooth interaction and effective memory management during operations.
  5. Extended Conversational Context: Through effective memory management, MemGPT facilitates extended conversational context, allowing for longer and more coherent interactions that surpass the limitations imposed by fixed-length context windows.

In essence, MemGPT represents a significant step forward in the utilization of Large Language Models, creating a pathway for more effective and extended interactions that resemble human discourse by smartly managing memory resources.

For more information you can access the official website here.

How does MemGPT Work?

MemGPT gives LLMs a feedback loop between user events, searching virtual context, and performing a function (source)

Imagine your computer’s OS, which deftly manages applications and data across RAM and disk storage, providing seamless access to resources beyond the physical memory limits. MemGPT mirrors this concept by working different memory tiers within an LLM. It includes:

  1. Main Context: Analogous to RAM, this is the immediate context the LLM processor works with during inference.
  2. External Context: Similar to a hard drive, this stores information beyond the LLM’s direct reach but can be accessed when needed.
  3. Interrupts: Like an OS interrupt, MemGPT can pause and resume the processor, managing the control flow with the user.

This architecture allows for dynamic context management, enabling the LLM to retrieve relevant historical data akin to how an OS handles page faults.

What problem does MemGPT solve?

MemGPT addresses several challenges associated with language modeling, particularly enhancing the capabilities of existing large language models (LLMs) like GPT-3. Here are the key problems it resolves:

  1. Long-term Context Retention:
    MemGPT introduces solutions for managing long-term context, a significant hurdle in advancing language modeling. By effectively managing memory, it can retain and access information over extended sequences, which is crucial for understanding and generating coherent responses in conversations or documents with many interactions or long texts.
  2. Enhanced Memory Management:
    It employs a tiered memory system, data transfer functions, and control via interrupts to manage memory efficiently. This setup enhances fixed-context LLMs, allowing them to handle tasks like document analysis and multi-session chat more effectively, overcoming the inherent context limitations in modern LLMs for better performance and user interactions.
  3. Extended Context Window:
    MemGPT effectively extends the context window of LLMs, enabling them to manage different memory tiers intelligently. This extended context is crucial for LLMs to have a more in-depth understanding and generate more coherent and contextually relevant responses over a series of interactions.
  4. Improved Interaction with Chatbots:
    By utilizing a memory hierarchy, MemGPT allows chatbots to access and modify information beyond their limited context window, facilitating more meaningful and prolonged interactions with users. This memory hierarchy enables the chatbot to move data between different layers of memory, ensuring relevant information is readily accessible when needed.

Through these solutions, MemGPT significantly bridges the gap between memory management and generative capacity in language modeling, paving the way for more sophisticated applications in various domains.

Comparing context lengths of commonly used models / APIs (data collected 9/2023).

*Assuming a preprompt of 1k tokens, and an average message size of ∼50 tokens (∼250 characters).

How to install MemGPT

PowerShell
pip install pymemgpt

Add your OpenAI API key to your environment:

PowerShell
export OPENAI_API_KEY=YOUR_API_KEY # on Linux/Mac
set OPENAI_API_KEY=YOUR_API_KEY # on Windows
$Env:OPENAI_API_KEY = "YOUR_API_KEY" # on Windows (PowerShell)

Configure default setting for MemGPT by running:

PowerShell
memgpt configure

Now, you can run MemGPT with:

PowerShell
memgpt run

The run command supports the following optional flags (if set, will override config defaults):

  • --agent: (str) Name of agent to create or to resume chatting with.
  • --human: (str) Name of the human to run the agent with.
  • --persona: (str) Name of agent persona to use.
  • --model: (str) LLM model to run [gpt-4, gpt-3.5].
  • --preset: (str) MemGPT preset to run agent with.
  • --first: (str) Allow user to sent the first message.
  • --debug: (bool) Show debug logs (default=False)
  • --no-verify: (bool) Bypass message verification (default=False)
  • --yes/-y: (bool) Skip confirmation prompt and use defaults (default=False)

You can run the following commands in the MemGPT CLI prompt:

  • /exit: Exit the CLI
  • /attach: Attach a loaded data source to the agent
  • /save: Save a checkpoint of the current agent/conversation state
  • /dump: View the current message log (see the contents of main context)
  • /memory: Print the current contents of agent memory
  • /pop: Undo the last message in the conversation
  • /heartbeat: Send a heartbeat system message to the agent
  • /memorywarning: Send a memory warning system message to the agent

You can find more information on the official GitHub website.

MemGPT for OpenAI Setup

Matthew Berman has produced a great review of the original MemGPT research paper, and initial setup for OpenAi API users.

Note in the video tutorial, Matthew refers to setup with a Conda environment, but this isn’t entirely necessary, it can also be done with a standard .venv environment.

MemGPT and Open Source Models Setup

In this video, Matthew Berman covers a quick setup for using MemGPT with open-source models like LLaMA, Airobors and Mistral via Runpod. Although this may sound complicated, it’s really not too difficult, and offers great potential cost savings vs using OpenAI.

Note open-source model support is still in early-stage development.

MemGPT and Autogen Setup

AutoGen is a tool that helps create LLM applications where multiple agents can talk to each other to complete tasks like for example brainstorming a business proposal. These AutoGen agents can be tailored, they can chat, and they easily let humans join in the conversation. In this tutorial Matthew Berman explains how to expand the memory of these AI agents by combining Autogen with MemGPT.

AutoGEN and MemGPT and Local LLM Complete Tutorial

Created by Prompt Engineer this 30 minute video covers in vast detail all the steps required to get this combination of solutions live with Runpod. As Prompt Engineer explains, this tutorial took quite a long time to produce, as it necessitated a number of test and learn steps. So far this is one of the most comprehensive tutorials available.

Summary: 00:11 🚀 The video demonstrates how to connect MemGPT, AutoGEN, and local Large Language Models (LLMs) using Runpods.

01:32 🤖 You can integrate MemGPT and AutoGEN to work together, with MemGPT serving as an assistant agent alongside local LLMs.

03:46 📚 To get started, install Python, VS Code, and create a Runpods account with credits. You can use Runpods for running local LLMs.

06:43 🛠️ Set up a virtual environment, create a Python file, and activate the environment for your project.

08:52 📦 Install necessary libraries like OpenAI, PyAutoGEN, and MGBPT to work with AutoGEN and MemGPT.

16:21 ⚙️ Use Runpods to deploy local LLMs, select the hardware configuration, and create API endpoints for integration with AutoGEN and MemGPT.

20:29 🔄 Modify the code to switch between using AutoGEN and MemGPT agents based on a flag, allowing you to harness the power of both.

23:31 🤝 Connect AutoGEN and MemGPT by configuring the API endpoints with the local LLMs from Runpods, enabling them to work seamlessly together.

Follow the exemple pyhton code:

requirements.txt

TeX
pyautogen
pymemgpt

app.py

Python
## pip install pyautogen pymemgpt

import os
import autogen
import memgpt.autogen.memgpt_agent as memgpt_autogen
import memgpt.autogen.interface as autogen_interface
import memgpt.agent as agent       
import memgpt.system as system
import memgpt.utils as utils 
import memgpt.presets as presets
import memgpt.constants as constants 
import memgpt.personas.personas as personas
import memgpt.humans.humans as humans
from memgpt.persistence_manager import InMemoryStateManager, InMemoryStateManagerWithPreloadedArchivalMemory, InMemoryStateManagerWithEmbeddings, InMemoryStateManagerWithFaiss
import openai

config_list = [
    {
        "api_type": "open_ai",
        "api_base": "https://ekisktiz8hegao-5001.proxy.runpod.net/v1",
        "api_key": "NULL",
    },
]

llm_config = {"config_list": config_list, "seed": 42}

# If USE_MEMGPT is False, then this example will be the same as the official AutoGen repo
# (https://github.com/microsoft/autogen/blob/main/notebook/agentchat_groupchat.ipynb)
# If USE_MEMGPT is True, then we swap out the "coder" agent with a MemGPT agent

USE_MEMGPT = True

## api keys for the memGPT
openai.api_base="https://ekisktiz8hegao-5001.proxy.runpod.net/v1"
openai.api_key="NULL"


# The user agent
user_proxy = autogen.UserProxyAgent(
    name="User_proxy",
    system_message="A human admin.",
    code_execution_config={"last_n_messages": 2, "work_dir": "groupchat"},
    human_input_mode="TERMINATE",  # needed?
    default_auto_reply="You are going to figure all out by your own. "
    "Work by yourself, the user won't reply until you output `TERMINATE` to end the conversation.",
)


interface = autogen_interface.AutoGenInterface()
persistence_manager=InMemoryStateManager()
persona = "I am a 10x engineer, trained in Python. I was the first engineer at Uber."
human = "Im a team manager at this company"
memgpt_agent=presets.use_preset(presets.DEFAULT_PRESET, model='gpt-4', persona=persona, human=human, interface=interface, persistence_manager=persistence_manager, agent_config=llm_config)


if not USE_MEMGPT:
    # In the AutoGen example, we create an AssistantAgent to play the role of the coder
    coder = autogen.AssistantAgent(
        name="Coder",
        llm_config=llm_config,
        system_message=f"I am a 10x engineer, trained in Python. I was the first engineer at Uber",
        human_input_mode="TERMINATE",
    )

else:
    # In our example, we swap this AutoGen agent with a MemGPT agent
    # This MemGPT agent will have all the benefits of MemGPT, ie persistent memory, etc.
    print("\nMemGPT Agent at work\n")
    coder = memgpt_autogen.MemGPTAgent(
        name="MemGPT_coder",
        agent=memgpt_agent,
    )


# Begin the group chat with a message from the user
user_proxy.initiate_chat(
    coder,
    message="Write a Function to print Numbers 1 to 10"
    )

Interview with MemGPT Co-Creator Charles Parker

For more information on the creators of MemGPT, also consider watching this video interview with one of its co-creators UC Berkley PHD student Charles Parker

MemGPT as Operation System

MemGPT draws inspiration from the virtual memory concept in operating systems and is innovatively applied to large language models to create an expansive context space. This innovation shines in scenarios like continuous conversations where traditional limitations on context length pose a challenge. By enabling large language models to handle their memory, MemGPT circumvents the usual restrictions set by fixed context lengths.

Limitations of MemGPT

Firstly, it’s essential to be aware that MemGPT is an emerging project currently undergoing enhancements. They have established a Discord group to foster idea-sharing and enable direct interaction with the creators. You are welcome to join in https://discord.gg/9GEQrxmVyE

Data Sensitivity: MemGPT’s reliance on previous interactions for context can raise concerns regarding data privacy and sensitivity, especially in scenarios involving personal or confidential information

Contextual Misinterpretations: While adept at handling extended conversations, MemGPT can occasionally misinterpret context, especially in nuanced or emotionally charged communications, leading to responses that may seem out of touch.

Resource Intensity: The system demands significant computational resources for optimal functionality, particularly for processing large volumes of data or maintaining extensive conversation histories.

Dependency on Quality Training Data: MemGPT’s effectiveness is closely tied to the quality of training data. Biased, inaccurate, or incomplete data can hinder the learning process, affecting the quality of interactions.

Adaptation to Diverse Discourses: The system’s ability to adapt to varying communication styles or understand different dialects and cultural nuances is still a work in progress, occasionally affecting its versatility in global or multicultural scenarios.

MemGPT vs Sparse Priming Representations (SPR)

MemGPT:

  • Inspiration: Takes cues from hierarchical memory systems used in traditional operating systems.
  • Functionality: Implements a tiered memory system that allows an LLM to extend its context window by managing which information is stored or retrieved, and when this should happen.
  • Structure: Comprises a Main Context (analogous to an OS’s main memory) and an External Context (similar to secondary storage).
  • Utility: Aims to revolutionize LLMs’ capabilities in tasks that involve unbounded context, such as long-form conversations and detailed document analysis.

Sparse Priming Representations (SPR):

  • Inspiration: Modeled after human memory organization and retrieval systems, focusing on critical information.
  • Functionality: Enhances memory system efficiency by creating concise primers that represent complex ideas, supporting the accuracy in understanding and recall.
  • Approach: Prioritizes intuitive and user-friendly memory management, akin to how humans naturally process and store information.
  • Utility: Focused on making LLMs more efficient in knowledge retrieval and learning, improving user engagement and communication tools.

Technical Implementation:

MemGPT:

  • Utilizes a structured approach for memory tier management, allowing for effective data movement and context management.
  • Tailored for scalability in dealing with large datasets and complex, extended tasks.

SPR:

  • Uses a method of creating primers that act as a distillation of complex information, allowing for a more intuitive memory management experience.
  • Geared towards mimicking human cognitive processes for better learning and communication outcomes.

Applications and Implications:

MemGPT:

  • May greatly benefit applications that require processing of large amounts of data over extended periods, like in-depth analysis and ongoing interactions.

SPR:

  • Could significantly enhance tools for learning and communication by providing users with easy-to-understand summaries or primers of complex topics.

Community and Engagement:

MemGPT:

  • Offers an open-source platform for developers and researchers to contribute to and enhance the capabilities of the memory management system.

SPR:

  • Encourages community involvement through contributions of new examples, research, and tools to improve the system’s efficiency and intuitiveness.

In conclusion, Both MemGPT and SPR are innovative responses to the challenges of memory management in LLMs, each with its own philosophy and methodology. MemGPT is more structural and system-oriented, potentially better for tasks that need management of extensive contexts. SPR is more user-centric and intuitive, possibly better for learning and communication by simplifying complex information.

While both aim to enhance LLMs’ handling of context, their underlying philosophies and expected applications differ, reflecting the diversity of approaches in advancing AI and ML capabilities. The ongoing developments and community contributions in both these areas show a vibrant and collaborative effort to push the boundaries of what’s possible with memory management in LLMs.

Conclusion

MemGPT stands as a testament to the power of innovation in AI, bridging the gap between what LLMs can do and what we aspire for them to achieve. As we march towards the future, the vision of LLMs as comprehensive operating systems doesn’t seem far-fetched—it’s nearly within our grasp, and MemGPT is leading the charge. What do you think?

That’s it for today!

Sources

cpacker/MemGPT: Teaching LLMs memory management for unbounded context 📚🦙 (github.com)

MemGPT: Overcoming Context Limitations for ChatGPT and Other LLMs for Document Chats & More (superthread.com)

MemGPT

2310.08560.pdf (arxiv.org)

What is MemGPT AI and MemGPT Installation Tutorial 2023 (dragganaitool.com)

Haly AI

Beyond Automation: Delving Deep into Microsoft’s AutoGen Conversational AI Framework

In the heart of innovation, Microsoft has crafted a gem known as AutoGen, a framework designed to foster the creation of applications through Large Language Models (LLMs). Unveiling a world where multi-agent conversations drive solutions, AutoGen is not just a tool but a revolutionary stride in AI technology.

Moreover, the realm of Large Language Models (LLMs) has been a buzzing hive of potential waiting to be harnessed. With AutoGen, the wait is over as it paves the way for seamless interactions among AI agents, humans, and tools, crafting a narrative of endless possibilities.

The Core Essence of AutoGen

At its core, AutoGen is an enabler, a catalyst that simplifies the intricacies of developing LLM-based applications. Its philosophy is rooted in collaborative problem-solving, where multiple agents can converse and solve tasks collectively.

Additionally, AutoGen goes beyond mere automation. It embodies optimization, ensuring that the workflow of applications is automated and optimized for peak performance. This is where AutoGen shines, revolutionizing the LLM application framework.

What capabilities does AutoGen offer?

The brilliance of AutoGen is seen in its ability to seamlessly blend the power of LLMs, human insights, and other tools, thereby simplifying the orchestration and optimization of complex workflows inherent in LLM applications. AutoGen facilitates efficient problem-solving through customizable conversational agents and paves the way for innovative applications across various domains.

  1. Multi-Agent Conversations:
  • You can create multi-agent systems where agents with specialized capabilities converse to solve tasks collaboratively. These conversations can occur between AI agents, humans, and AI, or a mix, expanding possibilities.
  1. LLM Workflow Automation and Optimization:
  • AutoGen simplifies the automation and optimization of intricate LLM workflows, which is especially beneficial as LLM-based applications become increasingly complex. This alleviates the challenges of orchestrating optimal workflows with robust performance.
  1. Customizable Conversational Agents:
  • Design and customize agents to your needs, whether based on LLMs, other tools, or even human inputs. This customization facilitates more effective solutions tailored to the unique requirements of your projects.
  1. Human-AI Collaboration:
  • AutoGen facilitates seamless integration between human input and AI capabilities, allowing for collaborative problem-solving. This is particularly useful in scenarios where the strengths of both humans and AI can be leveraged for better outcomes.
  1. Development of Advanced Applications:
  • Use AutoGen to develop advanced applications such as code-based question-answering systems, supply-chain optimization, and other scenarios where automated and optimized multi-agent conversations can significantly reduce manual interactions.
  1. Enhanced LLM Capabilities:
  • Extend the capabilities of advanced LLMs like GPT-4 by addressing their limitations through integration with other tools and human input, making them more robust and capable of handling multi-faceted tasks.
  1. Learning and Experimentation:
  • Being an open-source framework, AutoGen provides a playground for developers, researchers, and enthusiasts to learn, experiment, and contribute to the growing knowledge in AI and LLMs.
  1. Research and Innovation:
  • AutoGen can serve as a solid foundation for research and innovation in AI, especially in exploring the dynamics of multi-agent systems and human-AI collaboration.
  1. Community Contributions:
  • Being open-source, AutoGen encourages community contributions, which can lead to the development of new features, capabilities, and improvements in the framework, fostering a collaborative environment for advancing the state of AI.

AutoGen, with its ability to meld the prowess of LLMs, humans, and other tools through conversational agents, opens up a vast spectrum of opportunities for developers and organizations alike to harness the potential of AI in novel and impactful ways.

Agent’s concepts behind AutoGen

AutoGen abstracts and implements conversable agents designed to solve tasks through inter-agent conversations. Specifically, the agents in AutoGen have the following notable features:

  • Conversable: Agents in AutoGen are conversable, which means that any agent can send and receive messages from other agents to initiate or continue a conversation
  • Customizable: Agents in AutoGen can be customized to integrate LLMs, humans, tools, or a combination of them.

The figure below shows the built-in agents in AutoGen.

Source: Multi-agent Conversation Framework | AutoGen (microsoft.github.io)

The agents ConversableAgent, AssistantAgent, UserProxyAgent, and GroupChatManager are classes provided within the AutoGen framework, a system by Microsoft for facilitating multi-agent conversations in large language models (LLMs). Here’s a detailed breakdown of these agents:

  1. ConversableAgent:
  • A generic class designed for agents capable of conversing with each other through message exchange to complete a task.
  • Agents can communicate with other agents and perform actions, with their efforts potentially differing based on the messages they receive.
  • Provides an auto-reply capability for more autonomous multi-agent communication while retaining the option for human intervention.
  • Extensible by registering reply functions with the register_reply() method.
  1. AssistantAgent:
  • Acts as an AI assistant using LLMs by default, without requiring human input or code execution.
  • Can write Python code for a user to execute when a task description message is received, with the code generated by an LLM like GPT-4.
  • Receives execution results and suggests corrections or bug fixes if necessary.
  • Its behavior can be altered by passing a new system message, and LLM inference configuration can be managed via llm_config.
  1. UserProxyAgent:
  • Serves as a proxy agent for humans, soliciting human input for the agent’s replies at each interaction turn by default while also having the ability to execute code and call functions.
  • Triggers code execution automatically upon detecting an executable code block in the received message when no human user input is provided.
  • Code execution can be disabled, and LLM-based responses, which are disabled by default, can be enabled via llm_config. When llm_config is set as a dictionary, the UserProxyAgent can generate replies using an LLM when code execution is not performed.
  1. GroupChatManager:
  • A class inherited from ConversableAgent, designed to manage a group chat involving multiple agents.
  • Provides a method run_chat to initiate and manage a group chat, with parameters for messages, sender, and configuration.
  • This class appears to be in preview, indicating it might be a newer or less stable feature of AutoGen.

In practical terms, these agents facilitate complex workflows and interaction patterns among multiple entities, be they other AI agents, human users, or a combination of both. For example, the GroupChatManager could potentially moderate conversations between agents and humans, passing messages according to specific rules.

Examples of Various Applications Executed with AutoGen

The figure below shows six examples of applications built using AutoGen.

Here are some of AutoGen examples:

A Rising Competitive Arena

The domain of Large Language Model (LLM) application frameworks is swiftly evolving, with Microsoft’s AutoGen contending robustly amidst many competitors. LangChain is a framework for constructing a diverse range of LLM applications, spanning chatbots, text summarizers, and agents. At the same time, LlamaIndex provides abundant tools for interfacing LLMs with external data reservoirs like documents and databases.

ADVERTISEMENT

Similarly, libraries such as AutoGPT, MetaGPT, and BabyAGI rely on LLM agents and multi-agent application spheres. ChatDev employs LLM agents to mimic a full-fledged software development team. Concurrently, Hugging Face’s Transformers Agents library empowers developers to craft conversational applications that bridge LLMs with external tools.

The arena of LLM agents is a burgeoning focal point in research and development, with early-stage models already devised for a spectrum of tasks, including product evolution, executive functionalities, shopping, and market analysis. Research has unveiled the potential of LLM agents in simulating mass populace behaviors or generating realistic, non-playable personas in gaming environments. Yet, a substantial portion of this endeavor remains in the proof-of-concept stage, not quite ready for full-fledged production due to hurdles like hallucinations and erratic behavior exhibited by LLM agents.

Nonetheless, the outlook for LLM applications is promising, with agents poised to assume a pivotal role. Major tech entities are placing substantial bets on AI copilots becoming integral components of future applications and operating systems. LLM agent frameworks will allow companies to design customized AI copilots. The foray of Microsoft into this burgeoning arena with AutoGen underscores the escalating competition surrounding LLM agents and their prospective future impact.

Bridging the Gap: Human and AI Interaction

One of AutoGen’s hallmark features is its seamless integration of human input within the AI conversation. This blend of human and AI interaction is innovative and a game-changer in resolving complex tasks.

Moreover, this integration goes a long way in addressing the limitations of LLMs, making AutoGen a torchbearer in promoting harmonious human-AI collaborations.

Conclusion

AutoGen is more than just a tool; it’s a promise of the future. With its relentless innovation, Microsoft has given the world a framework that simplifies the development of LLM applications and pushes the boundaries of what’s achievable.

Moreover, as we delve deeper into the realm of AI, frameworks like AutoGen are set to play a pivotal role in shaping the narrative of AI, presenting a future where the sky is not the limit but just the beginning.

That’s it for today!

Sources

AutoGen: Enabling next-generation large language model applications – Microsoft Research

microsoft/autogen: Enable Next-Gen Large Language Model Applications. Join our Discord: https://discord.gg/pAbnFJrkgZ (github.com)

Microsoft’s AutoGen has multiple AI agents talk to do your work | VentureBeat

Open Interpreter’s ChatGPT-Like Interface: The Future of Natural Language Code Execution

There’s an ever-increasing need for practical, adaptable, and user-friendly tools in the dynamic world of software development and programming. Developers are looking for solutions that allow them to operate effortlessly in their local development setups without being hampered by the confines of online hosted platforms. This article delves into a revolutionary answer to these demands – Open Interpreter.

While the Code Interpreter from OpenAI, based on the GPT-4 architecture, is undeniably transformative, it has limitations. Being a hosted solution, OpenAI’s offering runs in a distant, regulated environment, restricting developers from harnessing the full capabilities of their local configurations. Among the challenges faced are no access to the internet, a confined list of pre-loaded packages, a cap on uploads at 100 MB, and a runtime restriction of 120 seconds. Moreover, if the session terminates, any produced files or links vanish, creating obstacles for extended assignments.

With Open Interpreter, developers can fully utilize their computer’s resources, side-stepping the limitations of online services.

Understanding Open Interpreter

At its core, Open Interpreter is a bridge between human language and computer code. Instead of manually writing out every line of code, users can describe their intentions in natural language, and Open Interpreter translates that into executable code. This offers an entirely new programming approach, simplifying and streamlining the process.

Features of Open Interpreter

  1. Code Interpretation from Natural Language: The primary selling point is that this feature allows users to describe a task in English, and the tool will convert it into executable code. It significantly reduces barriers to programming for non-developers.
  2. ChatGPT-Like Interface: With a user-friendly, chatbot-like interface, users can have interactive programming sessions, making the coding experience more intuitive.
  3. Data Handling Capabilities: Beyond simple code interpretation, Open Interpreter is a powerhouse for data-related tasks. It provides a comprehensive toolset for data professionals, from data visualization to in-depth analysis.
  4. Open-Source Nature: Open Interpreter is a gift to the developer community. As an open-source project, developers can contribute to its growth, refining its features and expanding its capabilities.
  5. Local Execution: In an age where data privacy is paramount, Open Interpreter allows for local code execution, ensuring that sensitive data remains on the user’s machine.

What can I do with Open Interpreter?

  1. Educational Purposes: This can revolutionize teaching programming and other computer-related tasks. Students can communicate with their machines more intuitively, eliminating the steep learning curve associated with traditional programming.
  2. Data Science & Analysis: As it can plot, clean, and analyze large datasets, researchers and data scientists can streamline their workflow using natural language to manipulate and visualize their data.
  3. Automation: For individuals who aren’t traditionally programmers, Open Interpreter might offer a more accessible avenue for automating tasks on their computers, from simple file manipulations to more complex functions like web scraping.
  4. Content Creation: The ability to create and edit photos, videos, and PDFs using natural language commands could revolutionize how content creators work. Imagine telling your computer how to edit a video or photo without manually going through the process.
  5. Integration with Other Tools: Mention of it working with Code Llama or GPT-4 by using an OpenAI API key suggests that it can be integrated with other tools, increasing its utility.
  6. Accessibility: For people with disabilities, communicating with a computer using natural language could make computing more accessible, especially if paired with voice recognition software.
  7. Security: As with any powerful tool, there are potential risks. Running code or commands without understanding the implications can lead to unintended consequences. Users should be educated about the risks, especially when executing commands that can modify their system or data.
  8. Open Source Community: The rapid growth in stars on GitHub indicates a strong community interest. This suggests that developers worldwide might contribute to its development, making the tool even more robust and versatile.

How do you install and use the Open Interpreter?

At this moment, you have to install it inside a Python environment.

  1. First, you have to install Python; in my case, I have Python 3.11.2 installed.
  2. In this step, it’s not necessary, but I recommend you install the Visual Studio Code. Press shift+control+p and create a Python virtual environment.
  3. I created the “requirements.txt file” with the ChatGPT Code Interpreter libraries.
Python
pandas
numpy
scipy
xarray
#Machine Learning:
scikit-learn
xgboost
keras
torch
#Natural Language Processing:
nltk
spacy
textblob
gensim
#Data Visualization:
matplotlib
seaborn
plotly
bokeh
#Web Scraping and Network:
requests
urllib3
aiohttp
beautifulsoup4
#Deep Learning:
keras
torch
theano
#Image Processing:
pillow
imageio
opencv-python
scikit-image
#Audio Processing:
librosa
pyaudio
soundfile
#File Format:
openpyxl
xlrd
pyPDF2
python-docx
#Database:
sqlalchemy
psycopg2
mysql-connector-python
#Web Development:
flask
django
tornado
quart
#Others:
pytest
joblib
pytz
pyyaml

python-docx
pdfkit
wkhtmltopdf
pandoc 
pdfplumber  
reportlab
open-interpreter

Type pip install -r requirements.txt to install the most used Python libraries.

PowerShell
pip install -r requirements.txt

After that, install the Open Interpreter.

PowerShell
pip install open-interpreter

After installation, run interpreter:

PowerShell
interpreter -y

Now, you are free to use and test the Code Interpreter. If you want more detailed information, go to the GitHub oficial page.

Insert the prompt:

create a csv file with a year from 2000 to 2023 and total of sales amount column with ramdom numbers

Note that a file called sales_data.csv was automatically created in the folder.

Now type this prompt: create a line visual to show this sales_data.csv and plot on the screen.

You can do whatever you want. You also can create a code to use Open Interpreter, like the example below.

Python
import interpreter

interpreter.chat("Plot AAPL and META's normalized stock prices") # Executes a single command
interpreter.chat() # Starts an interactive chat

Type py teste.py to execute this code.

Python
PS D:\interpreter> & d:/interpreter/.venv/Scripts/Activate.ps1
(.venv) PS D:\interpreter> py teste.py

▌ Model set to GPT-4
Tip: To run locally, use interpreter --local

Open Interpreter will require approval before running code. Use interpreter -y to bypass this.

Press CTRL-C to exit.

  To plot AAPL and META's normalized stock prices, we will need to follow these steps:

   1 Install necessary packages. We will need pandas, yfinance, and matplotlib for Python.
   2 Download the stock prices for AAPL and META using yfinance.
   3 Normalize the stock prices.
   4 Plot the normalized stock prices.

  Let's start with the first step, which is installing the necessary packages.

  import pkg_resources                                                                                                                                                          
                                                                                                                                                                                
  REQUIRED_PACKAGES = ['pandas', 'yfinance', 'matplotlib']                                                                                                                      
                                                                                                                                                                                
  for package in REQUIRED_PACKAGES:                                                                                                                                             
      try:                                                                                                                                                                      
          dist = pkg_resources.get_distribution(package)                                                                                                                        
          print('{} ({}) is installed'.format(dist.key, dist.version))                                                                                                          
      except pkg_resources.DistributionNotFound:                                                                                                                                
          print('{} is NOT installed'.format(package))                                                                                                                          

  import pkg_resources                                                                                                                                                          
                                                                                                                                                                                
  REQUIRED_PACKAGES = ['pandas', 'yfinance', 'matplotlib']                                                                                                                      
                                                                                                                                                                                
  for package in REQUIRED_PACKAGES:                                                                                                                                             
          dist = pkg_resources.get_distribution(package)                                                                                                                        
          print('{} ({}) is installed'.format(dist.key, dist.version))                                                                                                          
      except pkg_resources.DistributionNotFound:                                                                                                                                
          print('{} is NOT installed'.format(package))                                                                                                                          


  pandas (2.1.0) is installed
  yfinance (0.2.28) is installed
  matplotlib (3.8.0) is installed

  Great! All the necessary packages are already installed.

  Now, let's move on to the next step: downloading the stock prices for AAPL (Apple Inc.) and META (Meta Platforms Inc., formerly Facebook Inc.) using the yfinance package.    
  We will download the last 5 years of daily closing prices.

  import yfinance as yf                                                                                                                                                         
                                                                                                                                                                                
  # Define the ticker symbols                                                                                                                                                   
  symbols = ['AAPL', 'META']                                                                                                                                                    
                                                                                                                                                                                
  # Download the stock data                                                                                                                                                     
  stock_data = yf.download(symbols, start='2017-01-01', end='2022-12-31')['Close']                                                                                              
                                                                                                                                                                                
  # Print the first few rows of the data                                                                                                                                        
  stock_data.head()                                                                                                                                                             

  import yfinance as yf                                                                                                                                                         
                                                                                                                                                                                
  # Define the ticker symbols                                                                                                                                                   
  symbols = ['AAPL', 'META']                                                                                                                                                    
                                                                                                                                                                                
  stock_data = yf.download(symbols, start='2017-01-01', end='2022-12-31')['Close']                                                                                              
                                                                                                                                                                                
  # Print the first few rows of the data                                                                                                                                        
  stock_data.head()                                                                                                                                                             

  [                       0%%                      ]
  [*********************100%%**********************]  2 of 2 completed
  AAPL        META
  Date
  2017-01-03  29.037500  116.860001
  2017-01-04  29.004999  118.690002
  2017-01-05  29.152500  120.669998
  2017-01-06  29.477501  123.410004
  2017-01-09  29.747499  124.900002

  We have successfully downloaded the stock prices for AAPL and META.

  The next step is to normalize these stock prices. Normalizing the stock prices allows us to compare the two stocks on the same scale. We will normalize the prices such that  
  the first price is 1 and all subsequent prices are relative to the first price.

  normalized_data = stock_data / stock_data.iloc[0]                                                                                                                             
                                                                                                                                                                                
  # Print the first few rows of the normalized data                                                                                                                             
  normalized_data.head()                                                                                                                                                        

  normalized_data = stock_data / stock_data.iloc[0]                                                                                                                             
                                                                                                                                                                                
  # Print the first few rows of the normalized data                                                                                                                             
  normalized_data.head()                                                                                                                                                        

  AAPL      META
  Date
  2017-01-03  1.000000  1.000000
  2017-01-04  0.998881  1.015660
  2017-01-05  1.003960  1.032603
  2017-01-06  1.015153  1.056050


  The stock prices for AAPL and META have been successfully normalized.

  Now, let's move on to the final step: plotting the normalized stock prices. We will use the matplotlib package for this.

  import matplotlib.pyplot as plt                                                                                                                                               
                                                                                                                                                                                
  # Create a new figure                                                                                                                                                         
  plt.figure(figsize=(14, 7))                                                                                                                                                   
                                                                                                                                                                                
  # Plot the normalized stock prices                                                                                                                                            
  for symbol in symbols:                                                                                                                                                        
      plt.plot(normalized_data.index, normalized_data[symbol], label=symbol)                                                                                                    
                                                                                                                                                                                
  # Add a title and labels                                                                                                                                                      
  plt.title('Normalized Stock Prices: AAPL vs META')                                                                                                                            
  plt.xlabel('Date')                                                                                                                                                            
  plt.ylabel('Normalized Price')                                                                                                                                                
  plt.legend()                                                                                                                                                                  
                                                                                                                                                                                
  # Save the plot as an image                                                                                                                                                   
  plt.savefig('normalized_stock_prices.png')                                                                                                                                    
                                                                                                                                                                                
  # Close the plot                                                                                                                                                              
  plt.close()                                                                                                                                                                   

  import matplotlib.pyplot as plt                                                                                                                                               
                                                                                                                                                                                
  # Create a new figure                                                                                                                                                         
  plt.figure(figsize=(14, 7))                                                                                                                                                   
                                                                                                                                                                                
  # Plot the normalized stock prices                                                                                                                                            
  for symbol in symbols:                                                                                                                                                        
      plt.plot(normalized_data.index, normalized_data[symbol], label=symbol)                                                                                                    
                                                                                                                                                                                
  # Add a title and labels                                                                                                                                                      
  plt.title('Normalized Stock Prices: AAPL vs META')                                                                                                                            
  plt.xlabel('Date')                                                                                                                                                            
  plt.ylabel('Normalized Price')                                                                                                                                                
  plt.legend()                                                                                                                                                                  
                                                                                                                                                                                
  # Save the plot as an image                                                                                                                                                   
  plt.savefig('normalized_stock_prices.png')                                                                                                                                    
                                                                                                                                                                                
  # Close the plot                                                                                                                                                              
  plt.close()█                                                                                                                                                                  

  <Figure size 1400x700 with 0 Axes>
  [<matplotlib.lines.Line2D object at 0x0000024DE75D0B90>]
  [<matplotlib.lines.Line2D object at 0x0000024DE7604990>]
  Text(0.5, 1.0, 'Normalized Stock Prices: AAPL vs META')
  Text(0.5, 0, 'Date')
  Text(0, 0.5, 'Normalized Price')
  <matplotlib.legend.Legend object at 0x0000024DE75DAB50>

Let’s try asking for a search on the internet: What are the last 10 BBC news headlines?

Python
(.venv) PS D:\interpreter> interpreter -y
> What are the last 10 BBC news headlines?

  To get the last 10 BBC news headlines, we can use web scraping techniques. Python has several libraries for this purpose, such as BeautifulSoup and requests. Here is the     
  plan:

   1 Install necessary Python packages (requests and BeautifulSoup) if they are not already installed.
   2 Fetch the BBC News homepage HTML using requests.
   3 Parse the HTML using BeautifulSoup to find the news headlines.
   4 Extract and return the last 10 news headlines.

  Let's start with the first step, which is to check if the necessary Python packages are installed. If not, we will install them.



  import pkg_resources                                                                                                                                                          
                                                                                                                                                                                
  REQUIRED_PACKAGES = ['beautifulsoup4', 'requests']                                                                                                                            
                                                                                                                                                                                
  for package in REQUIRED_PACKAGES:                                                                                                                                             
      try:                                                                                                                                                                      
          dist = pkg_resources.get_distribution(package)                                                                                                                        
          print('{} ({}) is installed'.format(dist.key, dist.version))                                                                                                          
      except pkg_resources.DistributionNotFound:                                                                                                                                
          print('{} is NOT installed'.format(package))                                                                                                                          


  beautifulsoup4 (4.12.2) is installed
  requests (2.31.0) is installed


  Great! Both 'beautifulsoup4' and 'requests' packages are already installed.

  Now, let's move to the next step of the plan: Fetch the BBC News homepage HTML using requests. We will send a GET request to the BBC News homepage URL and print the status   
  of the response to ensure we are able to connect to the page.

  import requests                                                                                                                                                               
                                                                                                                                                                                
  url = 'https://www.bbc.co.uk/news'                                                                                                                                            
  response = requests.get(url)                                                                                                                                                  
                                                                                                                                                                                
  print('Status code:', response.status_code)                                                                                                                                   

  Status code: 200


  The status code 200 indicates that we successfully fetched the HTML of the BBC News homepage.

  Now, let's move to the next step of the plan: Parse the HTML using BeautifulSoup to find the news headlines. We will parse the HTML and find the elements that contain the    
  news headlines. For this, we need to know the structure of the BBC News homepage. The headlines are usually contained in HTML elements with specific class names. We will     
  try to find these elements and print some of them to verify that we are correctly extracting the headlines.

  from bs4 import BeautifulSoup                                                                                                                                                 
                                                                                                                                                                                
  soup = BeautifulSoup(response.text, 'html.parser')                                                                                                                            
                                                                                                                                                                                
  # BBC News headlines are usually in h3 elements with class 'gs-c-promo-heading__title'                                                                                        
                                                                                                                                                                                
  # Print the first 5 headlines to verify                                                                                                                                       
  for headline in headlines[:5]:                                                                                                                                                
      print(headline.text.strip())                                                                                                                                              

  Russell Brand accused of rape and sexual assault
  Russell Brand accused of rape and sexual assault
  A barren wasteland with a lingering smell of death
  Diamonds and diplomacy: How a US envoy fell from grace
  'Everything has just been completely ripped away'


  We have successfully extracted the news headlines from the BBC News homepage. The headlines printed above are a sample of the headlines we found.

  Now, let's move to the final step of the plan: Extract and return the last 10 news headlines. We will extract the text from the last 10 headline elements and return them.    

  # Get the last 10 headlines                                                                                                                                                   
  last_10_headlines = [headline.text.strip() for headline in headlines[-10:]]                                                                                                   
                                                                                                                                                                                
  last_10_headlines                                                                                                                                                             


  ['How old is too old to lead?', 'The companies staying fully remote', "A 'beach towel revolt' sweeping Greece", 'Sexton breaks record as eight-try Ireland beat Tonga',       
  'Wales struggle past Portugal at World Cup', 'Brighton beat Man Utd as fans boo Ten Hag', 'Barcelona crush Betis to go top', 'Sainz on Singapore pole as angry Verstappen     
  struggles', "'We have a good vibe' - Ten Hag avoids Sancho questioning", 'Newcastle beat Brentford to ease pressure on Howe']

  Here are the last 10 BBC news headlines:

    1 'How old is too old to lead?'
    2 'The companies staying fully remote'
    3 "A 'beach towel revolt' sweeping Greece"
    4 'Sexton breaks record as eight-try Ireland beat Tonga'
    5 'Wales struggle past Portugal at World Cup'
    6 'Brighton beat Man Utd as fans boo Ten Hag'
    7 'Barcelona crush Betis to go top'
    8 'Sainz on Singapore pole as angry Verstappen struggles'
    9 "'We have a good vibe' - Ten Hag avoids Sancho questioning"
   10 'Newcastle beat Brentford to ease pressure on Howe'

Can you create a PDF file with these 10 BBC news headlines to finish the test?

Python
> can you crate a pdf file with this 10 BBC news headlines?

  Sure, we can create a PDF file with these headlines. We will use the Python library reportlab for this purpose. Here is the plan:

   1 Install the necessary Python package (reportlab) if it is not already installed.
   2 Create a PDF file using reportlab and add the headlines to it.
   3 Save the PDF file.

  Let's start with the first step, which is to check if the necessary Python package is installed. If not, we will install it.


  REQUIRED_PACKAGES = ['reportlab']                                                                                                                                             
                                                                                                                                                                                
  for package in REQUIRED_PACKAGES:                                                                                                                                             
      try:                                                                                                                                                                      
          dist = pkg_resources.get_distribution(package)                                                                                                                        
          print('{} ({}) is installed'.format(dist.key, dist.version))                                                                                                          
      except pkg_resources.DistributionNotFound:                                                                                                                                
          print('{} is NOT installed'.format(package))                                                                                                                          


  reportlab (4.0.4) is installed


  Great! The 'reportlab' package is already installed.

  Now, let's move to the next step of the plan: Create a PDF file using reportlab and add the headlines to it. We will create a new PDF file, add each headline as a separate   
  paragraph, and then save the PDF file.



  from reportlab.lib.pagesizes import letter                                                                                                                                    
  from reportlab.pdfgen import canvas                                                                                                                                           
  from reportlab.lib.units import inch                                                                                                                                          
                                                                                                                                                                                
  # Create a new PDF file                                                                                                                                                       
  c = canvas.Canvas('BBC_headlines.pdf', pagesize=letter)                                                                                                                       
                                                                                                                                                                                
  # Set the font and size                                                                                                                                                       
  c.setFont('Helvetica', 12)                                                                                                                                                    
                                                                                                                                                                                
  # Add each headline as a separate paragraph                                                                                                                                   
  y = 750  # Start from the top of the page                                                                                                                                     
  for i, headline in enumerate(last_10_headlines, 1):                                                                                                                           
      c.drawString(50, y, f'{i}. {headline}')                                                                                                                                   
      y -= 25  # Move down for the next headline                                                                                                                                
                                                                                                                                                                                
  # Save the PDF file                                                                                                                                                           
  c.save()                                                                                                                                                                      

  The PDF file 'BBC_headlines.pdf' with the last 10 BBC news headlines has been created successfully. You can find it in the current working directory (D:\interpreter).        

It’s fantastic. Can you imagine how many things you can do with this?

They also have a Google Colab version with many examples of use.

I created a GitHub repository with this project. If you want, you can click here to download or clone in this hyperlink: https://github.com/LawrenceTeixeira/OpenInterpreter.git

Follow bellow the official Open Interpreter website.

https://openinterpreter.com/

Open Interpreter 🖥️ ChatGPT Code Interpreter You Can Run LOCALLY!

How Does Open Interpreter Stand Out?

While numerous code interpreters are available, Open Interpreter sets itself apart in its approach to bridging human language and computer code. The emphasis on natural language processing combined with the power of local execution ensures that it’s both accessible to newcomers and powerful enough for seasoned developers.

Conclusion

Open Interpreter stands as a monumental advancement in local development platforms. Its superior features equip developers to operate productively and with adaptability, unfettered by the confines of online services. With Open Interpreter by your side, step into the next era of software development and discover unparalleled avenues in your local development odyssey. It is more than just a tool; it’s a step forward in the evolution of programming. Combining the powers of natural language processing, AI, and code interpretation paves the way for a more inclusive and streamlined coding environment. As the world embraces digital solutions, tools like Open Interpreter make technology accessible to everyone.

That’s it for today!

Sources:

The Open Interpreter Project

KillianLucas/open-interpreter: OpenAI’s Code Interpreter in your terminal, running locally (github.com)

https://beebom.com/how-run-open-interpreter-run-chatgpt-code-interpreter-locally/

https://openaimaster.com/what-is-open-interpreter/

Data Wrangler in Microsoft Fabric: A New Tool for Accelerating Data Preparation. Experience the Power Query Feel but with Python Code Output

In the modern digital era, the importance of streamlined data preparation cannot be emphasized enough. For data scientists and analysts, a large portion of time is dedicated to data cleansing and preparation, often termed ‘wrangling.’ Microsoft’s introduction of Data Wrangler in its Fabric suite seems like an answer to this age-old challenge. It promises Power Query’s intuitiveness and Python code outputs’ flexibility. Dive in to uncover the magic of this new tool.

Data preparation is a time-consuming and error-prone task. It often involves cleaning, transforming, and merging data from multiple sources. This can be a daunting task, even for experienced data scientists.

What is Data Wrangler?

Data Wrangler is a state-of-the-art tool Microsoft offers in its Fabric suite explicitly designed for data professionals. At its core, it aims to simplify the data preparation process by automating tedious tasks. Much like Power Query, it offers a user-friendly interface, but what sets it apart is that it can generate Python code as an output. As users interact with the GUI, Python code snippets are generated behind the scenes, making integrating various data science workflows easier.

Advantages of Data Wrangler

  1. User-Friendly Interface: Offers an intuitive GUI for those not comfortable with coding.
  2. Python Code Output: Generates Python code in real-time, allowing flexibility and easy integration.
  3. Time-Saving: Reduces the time spent on data preparation dramatically.
  4. Replicability: Since Python code is generated, it ensures replicable data processing steps.
  5. Integration with Fabric Suite: Can be effortlessly integrated with other tools within the Microsoft Fabric suite.
  6. No-code to Low-code Transition: Ideal for those wanting to transition from a no-code environment to a more code-centric one.

How to use Data Wrangler?

You have to click on Data Science inside the Power BI Service.

You have to select the Notebook button.

You have to insert this code above after the upload of the CSV file in the LakeHouse.

Python
import pandas as pd

# Read a CSV into a Pandas DataFrame from e.g. a public blob store
df = pd.read_csv("/lakehouse/default/Files/Top_1000_Companies_Dataset.csv")

You have to click in the Lauch Data Wrangler and then select the data frame “df”.

On this screen, you can do all transformations you need.

In the end this code will be generate.

Python
# Code generated by Data Wrangler for pandas DataFrame

def clean_data(df):
    # Drop columns: 'company_name', 'url' and 6 other columns
    df = df.drop(columns=['company_name', 'url', 'city', 'state', 'country', 'employees', 'linkedin_url', 'founded'])
    # Drop columns: 'GrowjoRanking', 'Previous Ranking' and 10 other columns
    df = df.drop(columns=['GrowjoRanking', 'Previous Ranking', 'job_openings', 'keywords', 'LeadInvestors', 'Accelerator', 'valuation', 'btype', 'total_funding', 'product_url', 'growth_percentage', 'contact_info'])
    # Drop column: 'indeed_url'
    df = df.drop(columns=['indeed_url'])
    # Performed 1 aggregation grouped on column: 'Industry'
    df = df.groupby(['Industry']).agg(estimated_revenues_sum=('estimated_revenues', 'sum')).reset_index()
    # Sort by column: 'estimated_revenues_sum' (descending)
    df = df.sort_values(['estimated_revenues_sum'], ascending=[False])
    return df

df_clean = clean_data(df.copy())
df_clean.head()

After that, you can create or add to a pipeline or schedule a moment to execute this transformation automatically.

Data Wrangler Extension for Visual Studio Code

Data Wrangler is a code-centric data cleaning tool integrated into VS Code and Jupyter Notebooks. Data Wrangler aims to increase the productivity of data scientists doing data cleaning by providing a rich user interface that automatically generates Pandas code and shows insightful column statistics and visualizations.

This document will cover how to:

  • Install and setup Data Wrangler
  • Launch Data Wrangler from a notebook
  • Use Data Wrangler to explore your data
  • Perform operations on your data
  • Edit and export code for data wrangling to a notebook
  • Troubleshooting and providing feedback

Setting up your environment

  1. If you have not already done so, install Python.
    IMPORTANT: Data Wrangler only supports Python version 3.8 or higher.
  2. Install Visual Studio Code.
  3. Install the Data Wrangler extension for VS Code from the Visual Studio Marketplace. For additional details on installing extensions, see Extension Marketplace. The Data Wrangler extension is named Data Wrangler, and Microsoft publishes it.

When you launch Data Wrangler for the first time, it will ask you which Python kernel you would like to connect to. It will also check your machine and environment to see if any required Python packages are installed (e.g., Pandas).

Here is a list of the required versions for Python and Python packages, along with whether they are automatically installed by Data Wrangler:

NameMinimum required versionAutomatically installed
Python3.8No
pandas0.25.2Yes
regex*2020.11.13Yes

* We use the open-source regex package to be able to use Unicode properties (for example, /\p{Lowercase_Letter}/), which aren’t supported by Python’s built-in regex module (re). Unicode properties make it easier and cleaner to support foreign characters in regular expressions.

If they are not found in your environment, Data Wrangler will attempt to install them for you via pip. If Data Wrangler cannot install dependencies, the easiest workaround is to run the pip install and then relaunch Data Wrangler manually. These dependencies are required for Data Wrangler such that it can generate Python and Pandas code.

Connecting to a Python kernel

There are currently two ways to connect to a Python kernel, as shown in the quick pick below.
image

1. Connect using a local Python interpreter

If this option is selected, the kernel connection is created using the Jupyter and Python extensions. We recommend this option for a simple setup and a quick way to start with Data Wrangler.

2. Connect using Jupyter URL and token

A kernel connection is created using JupyterLab APIs if this option is selected. Note that this option has performance benefits since it bypasses some initialization and kernel discovery processes. However, it will also require separate Jupyter Notebook server user management. We recommend this option generally in two cases: 1) if there are blocking issues in the first method and 2) for power users who would like to reduce the cold-start time of Data Wrangler.

To set up a Jupyter Notebook server and use it with this option, follow the steps below:

  1. Install Jupyter. We recommend installing the accessible version of Anaconda with Jupyter installed. Alternatively, follow the official instructions to install it.
  2. In the appropriate environment (e.g., in an Anaconda prompt if Anaconda is used), launch the server with the following command (replace the jupyter token with your secure token):
    jupyter notebook --no-browser --NotebookApp.token='<your-jupyter-token>'
  3. In Data Wrangler, connect using the address of the spawned server. E.g., http://localhost:8888, and pass in the token used in the previous step. Once configured, this information is cached locally and can automatically be reused for future connections.

Launching Data Wrangler

Once Data Wrangler has been successfully installed, there are 2 ways to launch it in VS Code.

Launching Data Wrangler from a Jupyter Notebook

If you are in a Jupyter Notebook working with Pandas data frames, you’ll now see a “Launch Data Wrangler” button appear after running specific operations on your data frame, such as df.head(). Clicking the button will open a new tab in VS Code with the Data Wrangler interface in a sandboxed environment.

Important note:
We currently only accept the following formats for launching:

  • df
  • df.head()
  • df.tail()

Where df is the name of the data frame variable. The code above should appear at the end of a cell without any comments or other code after it.

image

Launching Data Wrangler directly from a CSV file

You can also launch Data Wrangler directly from a local CSV file. To do so, open any VS Code folder with the CSV dataset you’d like to explore. In the File Explorer panel, right-click the. CSV dataset and click “Open in Data Wrangler.”

image

Using Data Wrangler

image

The Data Wrangler interface is divided into 6 components, described below.

The Quick Insights header lets you quickly see valuable information about each column. Depending on the column’s datatype, Quick Insights will show the distribution of the data, the frequency of data points, and missing and unique values.

The Data Grid gives you a scrollable pane to view your entire dataset. Additionally, when selecting an operation to perform, a preview will be illustrated in the data grid, highlighting the modified columns.

The Operations Panel is where you can search through Data Wrangler’s built-in data operations. The operations are organized by their top-level category.

The Summary Panel shows detailed summary statistics for your dataset or a specific column if one is selected. Depending on the data type, it will show information such as min, max values, datatype of the column, skew, and more.

The Operation History Panel shows a human-readable list of all the operations previously applied in the current Data Wrangling session. It enables users to undo specific operations or edit the most recent operation. Selecting a step will highlight the data grid changes and show the generated code associated with that operation.

The Code Preview section will show the Python and Pandas code that Data Wrangler has generated when an operation is selected. It will remain blank when no operation is selected. The code can even be edited by the user, and the data grid will highlight the effect on the data.

Example: Filtering a column

Let’s go through a simple example using Data Wrangler with the Titanic dataset to filter adult passengers on the ship.

We’ll start by looking at the quick insights of the Age column, and we’ll notice the distribution of the ages and that the minimum age is 0.42. For more information, we can glance at the Summary panel to see that the datatype is a float, along with additional statistics such as the passengers’ mean and median age.

image

To filter for only adult passengers, we can go to the Operation Panel and search for the keyword “Filter” to find the Filter operation. (You can also expand the “Sort and filter” category to find it.)

image

Once we select an operation, we are brought into the Operation Preview state, where parameters can be modified to see how they affect the underlying dataset before applying the operation. In this example, we want to filter the dataset only to include adults, so we’ll want to filter the Age column to only include values greater than or equal to 18.

image

Once the parameters are entered in the operation panel, we can see a preview of what will happen to the data. We’ll notice that the minimum value in age is now 18 in the Quick Insights, along with a visual preview of the rows that are being removed, highlighted in red. Finally, we’ll also notice the Code Preview section automatically shows the code that Data Wrangler produced to execute this Filter operation. We can edit this code by changing the filtered age to 21, and the data grid will automatically update accordingly.

After confirming that the operation has the intended effect, we can click Apply.

Editing and exporting code

Each step of the generated code can be modified. Changes to the data will be highlighted in the grid view as you make changes.

Once you’re done with your data cleaning steps in Data Wrangler, there are 3 ways to export your cleaned dataset from Data Wrangler.

  1. Export code back to Notebook and exit: This creates a new cell in your Jupyter Notebook with all the data cleaning code you generated packaged into a clean Python function.
  2. Export data as CSV: This saves the cleaned dataset as a new CSV file onto your machine.
  3. Copy code to clipboard: This copies all the code generated by Data Wrangler for the data cleaning operations.
image

Note: If you launched Data Wrangler directly from a CSV, the first export option will be to export the code into a new Jupyter Notebook.

Data Wrangler operations

These are the Data Wrangler operations currently supported in the initial launch of Data Wrangler (with many more to be added soon).

OperationDescription
Sort valuesSort column(s) ascending or descending
FilterFilter rows based on one or more conditions
Calculate text lengthCreate new column with values equal to the length of each string value in a text column
One-hot encodeSplit categorical data into a new column for each category
Multi-label binarizerSplit categorical data into a new column for each category using a delimiter
Create column from formulaCreate a column using a custom Python formula
Change column typeChange the data type of a column
Drop columnDelete one or more columns
Select columnChoose one or more columns to keep and delete the rest
Rename columnRename one or more columns
Drop missing valuesRemove rows with missing values
Drop duplicate rowsDrops all rows that have duplicate values in one or more columns
Fill missing valuesReplace cells with missing values with a new value
Find and replaceSplit a column into several columns based on a user-defined delimiter
Group by column and aggregateGroup by columns and aggregate results
Strip whitespaceCapitalize the first character of a string with the option to apply to all words.
Split textRemove whitespace from the beginning and end of the text
Convert text to capital caseAutomatically create a column when a pattern is detected from your examples.
Convert text to lowercaseConvert text to lowercase
Convert text to uppercaseConvert text to UPPERCASE
String transform by exampleAutomatically perform string transformations when a pattern is detected from the examples you provide
DateTime formatting by exampleAutomatically perform DateTime formatting when a pattern is detected from the examples you provide
New column by exampleAutomatically create a column when a pattern is detected from the examples you provide.
Scale min/max valuesScale a numerical column between a minimum and maximum value
Custom operationAutomatically create a new column based on examples and the derivation of existing column(s)

Limitations

Data Wrangler currently supports only Pandas DataFrames. Support for Spark DataFrames is in progress.
Data Wrangler’s display works better on large monitors, although different interface portions can be minimized or hidden to accommodate smaller screens.

Conclusion

Data Wrangler in Microsoft Fabric is undeniably a game-changer in data preparation. It combines the best of both worlds by offering the simplicity of Power Query with the robustness and flexibility of Python. As data continues to grow in importance, tools like Data Wrangler that simplify and expedite the data preparation process will be indispensable for organizations aiming to stay ahead.

That’s it for today!

Sources:

https://medium.com/towards-data-engineering/data-wrangler-in-fabric-simplifying-data-prep-with-no-code-ab4fe7429b49

https://radacad.com/fabric-data-wrangler-a-tool-for-data-scientist

https://learn.microsoft.com/en-us/fabric/data-science/data-wrangler

https://marketplace.visualstudio.com/items?itemName=ms-toolsai.datawrangler

https://github.com/microsoft/vscode-data-wrangler