PandasAI: Doing Data Analytics with AI, locally, with Local LLMs
Published:
So I was poking around some tech guides the other day—you know how it is, one tab leads to another—and I stumbled upon a really interesting piece by Digital Archer group (old friends of mine) on using local LLMs for CSV analysis . It got me thinking about how much the “Text-to-SQL” world has dominated our conversations lately, while the “Text-to-Python” side of things is actually where a lot of the heavy lifting happens for us in daily data science workflows. It got me thinking about something we’ve all been wrestling with lately: how do we actually make AI work for data analysis without compromising on privacy or breaking the bank on API calls?
If you’re a Data Scientist, you’ve likely felt the friction of jumping between a hypothesis in your head and the specific boilerplate code needed to test it. This is exactly where PandasAI sits. It’s an open-source library that effectively wraps a conversational AI layer around your dataframes. Instead of writing out a complex groupby and agg chain, you can just ask, “Which regions outperformed their sales targets last quarter?” and it translates that intent into executable Python code.
The author of the blog, Viktor Vörös, walks through setting up PandasAI with a local model (specifically qwen2.5-coder), and honestly, it made me realize we’re at this weird inflection point. On one hand, cloud-based LLMs like GPT-4 are incredibly powerful. On the other, there’s something appealing about keeping your data local, especially when you’re dealing with sensitive customer information or proprietary business metrics.
This got me curious enough to dig deeper into PandasAI itself and how it might fit into a modern data science workflow. What I found was pretty interesting, so let me share what I’ve learned.
I’ve asked NotebookLM to a podcast about this topic, based on the conversation I had with it, check it out here
What Exactly Is PandasAI?
Think of PandasAI as a conversational layer on top of your regular pandas DataFrames. According to the official documentation, it’s an open-source Python library that lets you query your data using natural language instead of writing explicit pandas code. But here’s the thing—it’s not just about being lazy with syntax. It’s about changing how you interact with data exploration itself.
When you ask PandasAI something like “Show me the correlation between customer age and purchase frequency,” it doesn’t just run a predefined query. The LLM interprets your intent, generates appropriate Python code, executes it, and returns the result. Sometimes that’s a DataFrame, sometimes it’s a visualization, sometimes it’s just a number. The library handles all of that context for you.
What makes this interesting from a data science perspective is that it can handle multiple DataFrames simultaneously, perform joins on the fly, and even clean data as part of the query process. It’s like having a junior analyst who’s really good at pandas sitting next to you, except they work at LLM speeds.
Setting Up Your Environment
Let’s get practical. If you’re going to try this locally (which I’d recommend, at least initially), here’s what you need:
System Requirements:
- Python 3.8 through 3.11 (3.12 support is still being worked on)
- Decent RAM—if you’re running larger models locally, 16GB minimum, 32GB is better
- For GPU acceleration (highly recommended), an NVIDIA GPU with at least 8GB VRAM
Installation is straightforward:
pip install pandasai
But here’s where it gets interesting. You have choices about which LLM to use.
The Cloud Route
If you want to start simple:
import pandas as pd
from pandasai import SmartDataframe
from pandasai.llm import OpenAI
llm = OpenAI(api_token="your_api_key_here")
df = pd.read_csv("your_data.csv")
sdf = SmartDataframe(df, config={"llm": llm})
response = sdf.chat("What's the average value in column X?")
print(response)
This works, and it works well. OpenAI’s models are sophisticated enough to handle complex queries, understand context, and generate clean code. The problem? Every query is an API call. That adds up, especially if you’re iterating rapidly during exploratory data analysis.
The Local Route
This is where things get more interesting (and where Viktor’s blog really shines). To run everything locally, you need a model server. Ollama is probably the easiest option:
# Install Ollama from ollama.ai, then:
ollama pull qwen2.5-coder:32b
ollama serve
Then in your Python code:
from langchain_ollama import ChatOllama
from pandasai import SmartDataframe
import pandas as pd
llm = ChatOllama(model="qwen2.5-coder:32b")
df = pd.read_csv("customer_data.csv")
sdf = SmartDataframe(df, config={"llm": llm})
# Now you can query without any external API calls
response = sdf.chat("Show me customers who churned in the last quarter")
Why Go Local? The Honest Trade-offs
Look, I’m not going to pretend local LLMs are universally better. They’re not. But they have some real advantages that matter in practice.
The Good:
Privacy actually means privacy. When you’re working with healthcare data, financial records, or anything covered by GDPR, keeping data local isn’t just nice—it’s often mandatory. No data leaves your machine, period.
Cost structure is different. Yes, you need better hardware. But once you have it, your marginal cost per query is essentially zero. If you’re running thousands of exploratory queries a day (which you probably are during feature engineering), this matters.
Offline capability. This sounds trivial until you’re on a flight trying to finish an analysis, or your internet goes down, or you’re working from somewhere with terrible connectivity.
Control over the model. Want to fine-tune the model on your specific domain? Go ahead. Want to adjust temperature, top-k, or other generation parameters? You can do that too.
The Not-So-Good:
Output quality isn’t quite there yet. Let me be honest: GPT-4 is still better at understanding complex queries and generating correct code. The gap is closing—models like qwen2.5-coder and deepseek-coder are surprisingly capable—but it exists.
Hardware requirements are real. You need a decent machine. The 32B parameter models Viktor mentions need about 20GB of RAM. The 7B models will run on less but sacrifice some capability.
Setup complexity. There’s a learning curve. You need to understand model formats, quantization, context windows, and how to troubleshoot when things go wrong.
Slower response times. Even with GPU acceleration, local models are generally slower than hitting an API endpoint with massive infrastructure behind it.
How This Fits Into a Data Scientist’s Workflow
Here’s where PandasAI starts to make sense beyond just being a cool demo. Think about the typical data science process:
You get a new dataset. Maybe it’s customer transaction logs, maybe it’s sensor data from IoT devices, maybe it’s survey responses. Your first job is to understand it. That means exploring distributions, checking for missing values, understanding relationships between variables, and generating initial hypotheses.
Traditionally, this means writing a lot of boilerplate pandas code. Not hard code, necessarily, but repetitive:
df.describe()
df.info()
df.isnull().sum()
df['column_name'].value_counts()
# etc.
With PandasAI, you can just ask:
sdf.chat("Give me a comprehensive summary of this dataset including missing values, distributions, and potential data quality issues")
More importantly, you can ask follow-up questions without rewriting code:
sdf.chat("What's the correlation between purchase_amount and customer_age?")
sdf.chat("Show me the same correlation but only for customers in California")
sdf.chat("Now create a scatter plot of that relationship")
This iterative, conversational approach mirrors how you actually think about data. You don’t plan out all your analysis steps in advance—you follow the data where it leads you.
Practical Usage Patterns
Let me share some patterns I’ve found useful:
Temperature matters. When using local models, set the temperature to 0 or very close to it. You want deterministic code generation, not creative writing. Viktor mentions this in his blog and he’s absolutely right—it reduces the weird edge cases where the model gets creative with your queries.
Start broad, then narrow. Ask general questions first: “Summarize this data.” Then drill down: “Show me outliers in the revenue column.” Then get specific: “Create a boxplot showing revenue distribution by region, excluding outliers.”
Use it for data cleaning. PandasAI is actually quite good at identifying and fixing data quality issues:
sdf.chat("Find and show me all rows where the date format is inconsistent")
sdf.chat("Standardize the phone number column to (XXX) XXX-XXXX format")
Leverage the visualization capabilities. Rather than remembering matplotlib or seaborn syntax:
sdf.chat("Create a heatmap showing correlations between all numeric columns")
Integrating with Your Data Stack
Now, here’s where things get really interesting for production data science workflows. PandasAI doesn’t exist in a vacuum—you’re probably using other tools, and you want them to work together.
DuckDB Integration
DuckDB is increasingly becoming the go-to for analytical workloads. It’s fast, it handles larger-than-memory datasets gracefully, and it plays nice with pandas. Combining it with PandasAI gives you the best of both worlds.
import duckdb
from pandasai import SmartDataframe
from langchain_ollama import ChatOllama
# Connect to DuckDB
con = duckdb.connect('analytics.ddb')
# Query with DuckDB for efficiency
query = """
SELECT
customer_id,
SUM(purchase_amount) as total_spent,
COUNT(*) as transaction_count,
AVG(purchase_amount) as avg_purchase
FROM transactions
WHERE purchase_date >= '2024-01-01'
GROUP BY customer_id
"""
df = con.execute(query).df()
# Now use PandasAI for exploratory analysis
llm = ChatOllama(model="qwen2.5-coder:32b")
sdf = SmartDataframe(df, config={"llm": llm})
# Ask follow-up questions without writing more SQL
sdf.chat("Which customers spent more than $10,000 but only made 2 transactions?")
sdf.chat("Show me the distribution of average purchase amounts")
Why does this matter? DuckDB handles the heavy lifting—filtering millions of rows, performing aggregations, maybe even joining multiple tables. PandasAI handles the interactive, iterative exploration on the already-reduced result set. You’re not asking the LLM to generate SQL or process huge datasets; you’re using it for what it’s good at: flexible, conversational analysis of manageable data.
Apache Superset: Bridging to BI
Apache Superset is interesting because it represents the handoff from data science to business intelligence. You’ve done your analysis, built your models, generated insights—now you need to put that in front of stakeholders who aren’t comfortable with code.
The integration pattern here is less about direct technical connection and more about workflow:
import pandas as pd
from pandasai import SmartDataframe
from langchain_ollama import ChatOllama
# Use PandasAI for exploratory analysis
llm = ChatOllama(model="qwen2.5-coder:32b")
raw_data = pd.read_csv("campaign_data.csv")
sdf = SmartDataframe(raw_data, config={"llm": llm})
# Explore and identify interesting patterns
sdf.chat("What factors correlate most strongly with conversion rate?")
sdf.chat("Are there any unexpected patterns in the data?")
# Once you've identified key metrics and dimensions,
# create curated views for Superset
summary_df = sdf.chat("Create a summary table showing weekly conversion rates by channel and region")
# Save the processed data
summary_df.to_sql('campaign_summary', con=engine, if_exists='replace', index=False)
# Now this curated dataset is available in Superset for dashboard creation
What you’re doing here is using PandasAI’s flexibility during the exploratory phase to identify what actually matters. Then you’re creating clean, documented datasets that less technical users can visualize in Superset.
I’ve found that this two-phase approach works really well. Data scientists use PandasAI (with local LLMs for privacy) to rapidly iterate through hypotheses. Once they’ve found something worth sharing, they create production-ready views that get exposed through Superset dashboards. Business users get polished, interactive visualizations. Data scientists maintain control over the underlying analysis logic.
The Real-World Workflow
Let me walk through how this might actually look in practice. Say you’re analyzing customer churn:
Phase 1: Initial Exploration (PySpark + PandasAI)
# Process millions of customer records with Spark
churned_customers = spark.read.table("customer_events") \
.filter(col("event_type") == "churn") \
.join(customer_attributes, "customer_id") \
.toPandas()
# Explore with PandasAI
sdf = SmartDataframe(churned_customers, config={"llm": llm})
sdf.chat("What are the common characteristics of customers who churned?")
sdf.chat("Show me churn rate trends over time")
Phase 2: Deeper Analysis (DuckDB + PandasAI)
# Store intermediate results in DuckDB for efficient querying
con = duckdb.connect('churn_analysis.ddb')
con.execute("CREATE TABLE churned AS SELECT * FROM churned_customers")
# Complex analytical queries with DuckDB
high_value_churned = con.execute("""
SELECT * FROM churned
WHERE lifetime_value > 10000
AND tenure_months > 12
""").df()
# Conversational analysis with PandasAI
sdf = SmartDataframe(high_value_churned, config={"llm": llm})
sdf.chat("What's different about these high-value churners?")
Phase 3: Reporting (Apache Superset)
# Create summary tables for Superset
churn_metrics = con.execute("""
SELECT
DATE_TRUNC('month', churn_date) as month,
customer_segment,
COUNT(*) as churned_count,
AVG(lifetime_value) as avg_ltv_lost
FROM churned
GROUP BY month, customer_segment
""").df()
# Push to Superset's database
churn_metrics.to_sql('churn_dashboard_metrics',
con=superset_engine,
if_exists='replace')
The Limitations You Should Know About
I want to be upfront about where this approach falls short, because it’s not a magic bullet.
Context window constraints are real. Even with clever chunking strategies (which PandasAI does implement), you can’t just throw a 50GB CSV file at a local LLM and expect good results. Viktor’s blog mentions this—PandasAI uses a “snippet system” to work around it, but there are limits.
Complex operations can be hit-or-miss. Local models sometimes struggle with multi-step operations or unusual aggregations. GPT-4 might nail “calculate the 90th percentile of rolling 7-day average revenue, grouped by region and product category” on the first try. qwen2.5-coder might need you to break that into steps.
Error handling requires attention. The generated code isn’t always perfect. You need to check results, especially for business-critical analyses. This isn’t unique to local models—even GPT-4 makes mistakes—but the error rate is higher.
Model selection matters a lot. Not all local models are created equal for this use case. Code-focused models (like qwen2.5-coder or deepseek-coder) work better than general-purpose models. Bigger isn’t always better either—a well-tuned 7B model can outperform a poorly-configured 32B model.
Some Practical Tips from Experience
1. Be specific in your queries. “Analyze this data” is too vague. “Calculate mean, median, and standard deviation for the revenue column, then show me a histogram” works better.
2. Iterate in small steps. Rather than asking for a complex multi-stage analysis all at once, build it up:
- “Show me summary statistics”
- “Now break that down by category”
- “Create a bar chart of the results”
3. Set up proper error logging. When PandasAI generates code that fails, you want to see why:
import logging
logging.basicConfig(level=logging.DEBUG)
sdf = SmartDataframe(df, config={
"llm": llm,
"verbose": True,
"enable_cache": False # during development
})
4. Use the conversational mode strategically. PandasAI can maintain context across queries, which is great for follow-ups but can sometimes cause confusion if the conversation drifts. Start fresh when you switch to a different analysis thread.
5. Validate critical results. Especially with local models, double-check important findings. Ask the same question different ways. Look at the generated code. Make sure it’s doing what you think it’s doing.
Where This Is All Heading
The convergence of local LLMs and data analysis tools feels like it’s reaching a tipping point. A couple of years ago, running a capable language model locally was basically impossible for most people. Now you can run surprisingly good models on a decent workstation.
What excites me is the possibility of hybrid approaches. Imagine:
- Using local models for rapid iteration during exploration (fast, private, cheap)
- Calling out to GPT-4 for complex operations that need higher accuracy
- Automatically falling back to cloud models when local inference is too slow
- Fine-tuning local models on your organization’s specific data patterns
PandasAI already supports multiple LLM backends, so the infrastructure for this exists. It’s just a matter of building the right orchestration logic.
Should You Actually Use This?
Honestly? It depends on your situation.
You should definitely try it if:
- You handle sensitive data and can’t use cloud APIs
- You do a lot of exploratory analysis and would benefit from faster iteration
- You’re comfortable with some additional setup complexity
- You have decent hardware or are willing to invest in it
You might want to stick with cloud-based solutions if:
- You need absolute maximum accuracy on complex queries
- Your data isn’t particularly sensitive
- You’re working on a laptop without much RAM or GPU
- You value simplicity over cost optimization
A hybrid approach makes sense if:
- You have a mix of sensitive and non-sensitive data
- You want to optimize costs but can’t compromise on capability
- You’re willing to invest time in setting up proper orchestration
Final Thoughts
What Digitial Archer’s blog post really highlights—and what I’ve come to appreciate while digging into this—is that we’re at this interesting moment where local AI is actually viable for real work. It’s not just a privacy-obsessed curiosity anymore.
The combination of PandasAI with tools like DuckDB, PySpark, and Superset creates a surprisingly complete data analysis stack. You’re not choosing between local and cloud, between conversational and programmatic, between exploration and production. You can have all of it, integrated in a way that makes sense for your specific workflow.
Is it perfect? No. Are there rough edges? Absolutely. But the trajectory is clear. Local models are getting better fast. The tooling around them is maturing. And the integration with existing data infrastructure is increasingly seamless.
If you’re a data scientist who hasn’t experimented with this approach yet, I’d encourage you to spend an afternoon setting it up. Pull down Ollama, install PandasAI, load up a dataset you know well, and just play around. See what works, see what doesn’t, and see if it changes how you think about data exploration.
Because ultimately, that’s what this is all about: giving us better tools to ask questions of our data, to follow hunches, to iterate quickly, and to find insights we might have missed if we were too worried about API costs or privacy concerns to just explore freely.
And sometimes, freely exploring is where the best insights come from.
Note: All code examples in this post are illustrative. Your actual implementation will depend on your specific environment, data formats, and requirements. Always validate outputs, especially when using generated code in production contexts.
