AI Data Analyst Agent

A multi agent AI system that automates ETL, cleaning, analysis with always human in the loop and self correcting capabilities, and allow users to query and clean the data using natural language.

The user ingests raw datasets, the system profiles schema structures, proposes deterministic cleaning strategies, executes transformations securely using sandbox on human approval, and provides conversational SQL and visualization interfaces allowing users to query and clean the data using natural language.

System Architecture

The core workflow is orchestrated using LangGraph as a Super Graph containing independently compiled sub graphs.

Ingestion Agent: Optimized dataset loading with vectorized Pandas processing.
Merging Agent: Intelligent schema similarity analysis and automated join strategy generation.
Cleaning Agent: Profiler, Strategist, and Engineer nodes generating structured Pydantic cleaning plans.
Chat Agent: Semantic routing to DuckDB SQL expert or Python visualization node.

Execution Safety

All LLM generated code is executed inside an isolated Docker sandbox. Strict timeout controls and file system isolation prevent system level risks.

State Management

The system supports Human in the Loop workflows through PostgreSQL based checkpointing using AsyncPostgresSaver. Nested graph states are recursively flattened to allow seamless resumption across stateless HTTP calls.

Performance Metrics

Graph Execution p50 Latency: 2.478s
LLM Generation p50 Latency: 1.765s
Intent Routing p50 Latency: 0.473s

Frontend Interface

Tech Stack

FastAPI, LangChain, LangGraph, Groq Llama 3 models, Pandas, DuckDB, Docker, PostgreSQL, Redis, Langfuse