Building a Secure Sandbox for LangChain's create_pandas_dataframe_agent Using Docker

Problem

Qing Ye

~3 min read · December 22, 2024 (Updated: December 22, 2024) · Free: Yes

Problem

The create_pandas_dataframe_agent utility in LangChain is a powerful agent for interacting with dataframes. However, it poses a significant security risk when used as-is. The tool can execute arbitrary Python code, which means a malicious input could run harmful operations directly on your machine.

Without a built-in mechanism to isolate code execution, securing your environment becomes a challenge. While LangChain allows tool customization, creating a robust sandbox isn't straightforward.

Solution

By leveraging Docker, we can build a secure sandbox to isolate code execution from the host machine. Here's how we can implement and use a DockerSandbox class that inherits LangChain's PythonAstREPLTool. This approach ensures code is executed in a controlled environment with clear boundaries.

Steps to Implement the Sandbox

Ensure Docker Is Installed

Before proceeding, make sure Docker is installed and running on your machine. You can verify this by running:

docker --version

2. Create the DockerSandbox Class

We build a DockerSandbox class that inherits LangChain's PythonAstREPLTool. This provides:

Seamless Integration: Inheriting PythonAstREPLTool ensures compatibility with LangChain's tool ecosystem.
Code Reuse: Leverages built-in functionality for parsing and evaluating Python code.
Security: Docker isolates the code execution, protecting your machine from potential risks.

class DockerSandbox(PythonAstREPLTool):
    """An enhanced DockerSandbox for running Python code in a Docker container."""

    name: str = "sandbox_execution"
    description: str = "Executes Python code securely in Docker container"
    df: pd.DataFrame = Field(default_factory=pd.DataFrame)
    _client: docker.DockerClient = PrivateAttr(default=None)  # Private attribute

    def __init__(self, dataframe: pd.DataFrame):
        super().__init__()
        self.df = dataframe
        self._client = docker.from_env()

    def _run(self, query: str, run_manager=None) -> str:
        unique_id = uuid.uuid4().hex
        temp_data_path = f"/tmp/data_{unique_id}.csv"

        # Clean up old files
        for file in ["/tmp/result.csv", "/tmp/result.txt"]:
            if os.path.exists(file):
                os.remove(file)

        # Save the dataframe
        self.df.to_csv(temp_data_path, index=False)

        # Sanitize input if the flag is set
        if self.sanitize_input:
            query = sanitize_input(query)
        indented_query = "\n".join([f"    {line}" for line in query.splitlines()])

        # Add necessary imports and sandbox code
        full_code = f"""
import pandas as pd
df = pd.read_csv('/data/data_{unique_id}.csv')

try:
{indented_query}
except Exception as e:
    print(f"Error during execution: {{str(e)}}")
"""
        # Debugging: Print the final script
        print("Executing the following Python script in Docker:")
        print(full_code)

        try:
            container_output = self._client.containers.run(
                "jupyter/scipy-notebook:latest",
                command=["python", "-c", full_code],
                volumes={"/tmp": {"bind": "/data", "mode": "rw"}},
                mem_limit="100m",
                cpu_period=100000,
                cpu_quota=50000,
                network_mode="none",
                remove=True,
                stdout=True,
                stderr=True,
            )

            # Read results
            if os.path.exists("/tmp/result.txt"):
                with open("/tmp/result.txt") as f:
                    result = f.read()
                return result

            decoded_output = container_output.decode("utf-8")
            print("Container output:")
            print(decoded_output)
            return decoded_output

        except Exception as e:
            return f"Error: {str(e)}"

3. Pass the Sandbox to the Agent

After creating the DockerSandbox, you can pass it to the agent to replace the default tool. Here's how:

sandbox = DockerSandbox(df)
agent = create_pandas_dataframe_agent(
    llm=llm,
    df=df,
    verbose=True,
    allow_dangerous_code=True,
)
agent.tools = [sandbox]

Testing the Solution

I tested this solution using Google's gemini-2.0-flash-exp model and the Titanic dataset. The input asked for the most statistically significant factors affecting survival:

agent.invoke(
    "What are the most (statistically) significant factors that causually affect the survival rate of passengers on the Titanic?"
)

Output:

"The most statistically significant factors affecting survival are 'Pclass', 'Age', and 'SibSp'."

Final Thoughts

This sandbox solution not only mitigates the security risks of running arbitrary Python code but also integrates seamlessly with LangChain's framework. With Docker's isolation capabilities and LangChain's flexibility, this approach is a robust solution for secure, interactive data analysis.

You can test this solution by cloning my notebook here.

Happy coding! 🚀

#langchain-agents #langchain-tools #llm-agent #data-analytics

< Go to the original

Building a Secure Sandbox for LangChain's create_pandas_dataframe_agent Using Docker

Problem

Problem

Solution

Final Thoughts

Reporting a Problem