๐ Data Analysis Assistant: LLM-Powered Data Explorationยถ
Welcome to the Data Analysis Assistant tutorial! This guide will help you leverage CellMageโs capabilities to enhance your data analysis workflows, generate insights, and create visualizations through natural language interactions.
Note: You can control where CellMage stores all working files (snippets, personas, logs, etc.) by setting the
CELLMAGE_BASE_DIRenvironment variable. See the README for details.
๐ฏ What Youโll Learnยถ
In this tutorial, youโll discover:
How to use CellMage as an intelligent data analysis companion
Techniques for data exploration and visualization through natural language
Strategies for generating and refining analysis code
Methods for explaining complex statistical concepts
Best practices for data-centric LLM workflows
๐งโโ๏ธ Prerequisitesยถ
Before diving in, make sure:
You have basic knowledge of Python data science libraries (pandas, matplotlib, etc.)
You have a Jupyter notebook environment set up
You have CellMage installed and loaded:
%load_ext cellmage
You have common data science packages installed:
pip install pandas matplotlib seaborn scikit-learn
๐ Step 1: Preparing Your Data Analysis Environmentยถ
Letโs start by setting up an environment with some sample data:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set plot styling
plt.style.use('ggplot')
%matplotlib inline
# Load sample dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
# Display the first few rows
df.head()
๐งฌ Step 2: Basic Data Exploration with LLMยถ
Use CellMage to explore your dataset through natural language:
# Add your dataframe to the context
%llm_config --snippet df.head().to_string()
%llm_config --snippet df.describe().to_string()
%llm_config --snippet df.info()
# Ask questions about your data
%%llm
I'm working with a dataset called 'df'. Based on what you can see:
1. What kind of data is this?
2. What are the main variables?
3. What initial insights can you provide?
4. What analysis would you recommend?
๐ Step 3: Generating Data Visualization Codeยถ
Ask CellMage to help you create visualizations:
%%llm
Generate Python code to create these visualizations for my tips dataset:
1. A histogram of tip amounts
2. A scatter plot of bill amount vs. tip amount, colored by time of day
3. A box plot showing tip percentage across different days of the week
4. A heatmap of correlations between numerical variables
Make sure the code is well-commented and uses seaborn for better aesthetics.
Now execute the generated code in a new cell to see your visualizations.
๐ Step 4: Targeted Data Analysisยถ
Ask for specific analyses based on your preliminary findings:
%%llm
Based on the tips dataset, generate code to analyze:
1. Whether there's a statistically significant difference in tip percentages between lunch and dinner
2. If the day of the week affects tipping behavior
3. The relationship between party size and average tip percentage
Include proper statistical tests and clear explanations of the results.
๐ง Step 5: Statistical Concept Explanationsยถ
When you encounter unfamiliar statistical concepts:
%%llm
Please explain the following concepts in the context of our tips dataset analysis:
1. What is a t-test and when should I use it?
2. What does p-value mean in our analysis?
3. What are the assumptions of linear regression?
4. How should I interpret the correlation coefficient?
Explain with simple examples using our restaurant tipping context.
๐ ๏ธ Step 6: Data Cleaning and Preparationยถ
Get help with data cleaning and transformation:
%%llm
Generate code to clean and prepare the tips dataset:
1. Create a new 'tip_percentage' column (tip as a percentage of total bill)
2. Convert categorical variables to appropriate formats for modeling
3. Check for and handle any potential outliers
4. Split the data into training and testing sets for a model that predicts tip percentage
Include explanations for each preprocessing step.
๐ค Step 7: Model Building Assistanceยถ
Ask for help creating predictive models:
%%llm
Write code to build three different models to predict tip percentage:
1. A simple linear regression model
2. A random forest regressor
3. A gradient boosting model
Include feature selection, model training, evaluation metrics, and a comparison of the models' performance.
๐ Step 8: Results Interpretationยถ
After running analyses, ask the LLM to help interpret results:
# First, run your analysis code and capture the output
# Then paste relevant results as context
%llm_config --snippet """
Model Results:
Linear Regression Rยฒ = 0.45
Random Forest Rยฒ = 0.67
Gradient Boosting Rยฒ = 0.72
Feature Importances:
- party_size: 0.35
- total_bill: 0.25
- day (Saturday): 0.15
- time (Dinner): 0.12
- smoker (Yes): 0.08
- day (Sunday): 0.05
"""
%%llm
Interpret these model results for predicting tip percentage:
1. Which model performed best and why?
2. What are the most important factors influencing tip amounts?
3. How would you explain these results to a restaurant manager?
4. What recommendations would you make based on this analysis?
๐ Step 9: Generating Analysis Reportsยถ
Use CellMage to help create professional reports:
%%llm
Based on our analysis of the restaurant tipping dataset, generate a structured report with these sections:
1. Executive Summary (2-3 paragraphs)
2. Methodology (brief description of data and approaches)
3. Key Findings (bullet points of 4-5 main discoveries)
4. Visualizations to Include (list the most important ones)
5. Business Recommendations (3-4 actionable insights)
6. Future Analysis Suggestions
Write this for a restaurant management audience.
๐ Step 10: Refining Analysis Iterativelyยถ
Use the iterative nature of CellMage to refine your analysis:
%%llm
The scatter plot showing tip amount vs total bill revealed a potential non-linear relationship.
Suggest three different transformations we could apply, and generate code to:
1. Apply these transformations
2. Visualize the transformed relationships
3. Determine which transformation makes the relationship most linear
4. Rebuild our regression model with the best transformation
๐งช Advanced Data Analysis Applicationsยถ
Time Series Analysisยถ
# First load a time series dataset
ts_data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv')
ts_data['Month'] = pd.to_datetime(ts_data['Month'])
ts_data.set_index('Month', inplace=True)
ts_data.head()
%%llm
Generate comprehensive time series analysis code for this airline passenger dataset:
1. Decompose the series into trend, seasonal, and residual components
2. Test for stationarity and apply transformations if needed
3. Build and evaluate a forecasting model (ARIMA or Prophet)
4. Generate and visualize a 12-month forecast with confidence intervals
5. Explain key insights about seasonality and trends
Include well-commented code and explanations.
Exploratory Data Analysis (EDA) Pipelineยถ
%%llm
Create a reusable EDA function that I can apply to any dataset. The function should:
1. Provide basic statistics (missing values, duplicates, descriptive stats)
2. Generate appropriate visualizations based on data types
3. Identify potential outliers and unusual patterns
4. Examine correlations and relationships between variables
5. Suggest potential feature engineering steps
6. Output a summary of findings
Make it modular and well-commented so I can adapt it for future datasets.
Interactive Dashboard Code Generatorยถ
%%llm
Generate code for a simple Plotly Dash application that creates an interactive dashboard for the tips dataset with:
1. A dropdown to select variables for the x and y axes
2. Filter options for categorical variables (day, time, smoker)
3. Radio buttons to switch between different chart types
4. A statistics panel showing key metrics about the selected data
5. A data table showing the filtered records
Include all necessary code to run the application locally.
โ ๏ธ Limitations and Best Practicesยถ
Working with data and LLMs has some important considerations:
Data privacy - Be careful not to share sensitive data with LLMs
Verification is essential - Always review and validate generated code and conclusions
Context limitations - Large datasets may exceed token limits; share summaries instead
Statistical accuracy - Double-check statistical claims and methodologies
Code execution - Test generated code on small data subsets before running on large datasets
๐ฆ Data Analysis Best Practicesยถ
Start broad, then narrow - Begin with exploratory questions before diving into specifics
Provide clear context - Share data types, sizes, and descriptive statistics
Ask for explanations - Request comments in generated code to improve understanding
Iterate on results - Use outputs to inform your next set of questions
Combine expertise - Use LLM suggestions alongside your domain knowledge
๐ Whatโs Next?ยถ
Now that youโve learned to use CellMage for data analysis:
Explore Code Generation Workflow for more complex data projects
Try Document Summarization for analyzing text data
Learn about Chain of Thought techniques for complex analytical reasoning
May your data reveal its secrets through magical assistance! โจ