Problem
The create_pandas_dataframe_agent utility in LangChain is a powerful agent for interacting with dataframes. However, it poses a significant security risk when used as-is. The tool can execute arbitrary Python code, which means a malicious input could run harmful operations directly on your machine.
Without a built-in mechanism to isolate code execution, securing your environment becomes a challenge. While LangChain allows tool customization, creating a robust sandbox isn't straightforward.
Solution
By leveraging Docker, we can build a secure sandbox to isolate code execution from the host machine. Here's how we can implement and use a DockerSandbox class that inherits LangChain's PythonAstREPLTool. This approach ensures code is executed in a controlled environment with clear boundaries.
Steps to Implement the Sandbox
- Ensure Docker Is Installed
Before proceeding, make sure Docker is installed and running on your machine. You can verify this by running:
docker --version2. Create the DockerSandbox Class
We build a DockerSandbox class that inherits LangChain's PythonAstREPLTool. This provides:
- Seamless Integration: Inheriting PythonAstREPLTool ensures compatibility with LangChain's tool ecosystem.
- Code Reuse: Leverages built-in functionality for parsing and evaluating Python code.
- Security: Docker isolates the code execution, protecting your machine from potential risks.
class DockerSandbox(PythonAstREPLTool):
"""An enhanced DockerSandbox for running Python code in a Docker container."""
name: str = "sandbox_execution"
description: str = "Executes Python code securely in Docker container"
df: pd.DataFrame = Field(default_factory=pd.DataFrame)
_client: docker.DockerClient = PrivateAttr(default=None) # Private attribute
def __init__(self, dataframe: pd.DataFrame):
super().__init__()
self.df = dataframe
self._client = docker.from_env()
def _run(self, query: str, run_manager=None) -> str:
unique_id = uuid.uuid4().hex
temp_data_path = f"/tmp/data_{unique_id}.csv"
# Clean up old files
for file in ["/tmp/result.csv", "/tmp/result.txt"]:
if os.path.exists(file):
os.remove(file)
# Save the dataframe
self.df.to_csv(temp_data_path, index=False)
# Sanitize input if the flag is set
if self.sanitize_input:
query = sanitize_input(query)
indented_query = "\n".join([f" {line}" for line in query.splitlines()])
# Add necessary imports and sandbox code
full_code = f"""
import pandas as pd
df = pd.read_csv('/data/data_{unique_id}.csv')
try:
{indented_query}
except Exception as e:
print(f"Error during execution: {{str(e)}}")
"""
# Debugging: Print the final script
print("Executing the following Python script in Docker:")
print(full_code)
try:
container_output = self._client.containers.run(
"jupyter/scipy-notebook:latest",
command=["python", "-c", full_code],
volumes={"/tmp": {"bind": "/data", "mode": "rw"}},
mem_limit="100m",
cpu_period=100000,
cpu_quota=50000,
network_mode="none",
remove=True,
stdout=True,
stderr=True,
)
# Read results
if os.path.exists("/tmp/result.txt"):
with open("/tmp/result.txt") as f:
result = f.read()
return result
decoded_output = container_output.decode("utf-8")
print("Container output:")
print(decoded_output)
return decoded_output
except Exception as e:
return f"Error: {str(e)}"3. Pass the Sandbox to the Agent
After creating the DockerSandbox, you can pass it to the agent to replace the default tool. Here's how:
sandbox = DockerSandbox(df)
agent = create_pandas_dataframe_agent(
llm=llm,
df=df,
verbose=True,
allow_dangerous_code=True,
)
agent.tools = [sandbox]Testing the Solution
I tested this solution using Google's gemini-2.0-flash-exp model and the Titanic dataset. The input asked for the most statistically significant factors affecting survival:
agent.invoke(
"What are the most (statistically) significant factors that causually affect the survival rate of passengers on the Titanic?"
)Output:
"The most statistically significant factors affecting survival are 'Pclass', 'Age', and 'SibSp'."
Final Thoughts
This sandbox solution not only mitigates the security risks of running arbitrary Python code but also integrates seamlessly with LangChain's framework. With Docker's isolation capabilities and LangChain's flexibility, this approach is a robust solution for secure, interactive data analysis.
You can test this solution by cloning my notebook here.
Happy coding! 🚀