๐Ÿ“Š Data Analysis Assistant: LLM-Powered Data Explorationยถ

Welcome to the Data Analysis Assistant tutorial! This guide will help you leverage CellMageโ€™s capabilities to enhance your data analysis workflows, generate insights, and create visualizations through natural language interactions.

Note: You can control where CellMage stores all working files (snippets, personas, logs, etc.) by setting the CELLMAGE_BASE_DIR environment variable. See the README for details.

๐ŸŽฏ What Youโ€™ll Learnยถ

In this tutorial, youโ€™ll discover:

  • How to use CellMage as an intelligent data analysis companion

  • Techniques for data exploration and visualization through natural language

  • Strategies for generating and refining analysis code

  • Methods for explaining complex statistical concepts

  • Best practices for data-centric LLM workflows

๐Ÿง™โ€โ™‚๏ธ Prerequisitesยถ

Before diving in, make sure:

  • You have basic knowledge of Python data science libraries (pandas, matplotlib, etc.)

  • You have a Jupyter notebook environment set up

  • You have CellMage installed and loaded:

    %load_ext cellmage
    
  • You have common data science packages installed:

    pip install pandas matplotlib seaborn scikit-learn
    

๐Ÿ“ Step 1: Preparing Your Data Analysis Environmentยถ

Letโ€™s start by setting up an environment with some sample data:

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set plot styling
plt.style.use('ggplot')
%matplotlib inline

# Load sample dataset
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# Display the first few rows
df.head()

๐Ÿงฌ Step 2: Basic Data Exploration with LLMยถ

Use CellMage to explore your dataset through natural language:

# Add your dataframe to the context
%llm_config --snippet df.head().to_string()
%llm_config --snippet df.describe().to_string()
%llm_config --snippet df.info()

# Ask questions about your data
%%llm
I'm working with a dataset called 'df'. Based on what you can see:
1. What kind of data is this?
2. What are the main variables?
3. What initial insights can you provide?
4. What analysis would you recommend?

๐Ÿ“ˆ Step 3: Generating Data Visualization Codeยถ

Ask CellMage to help you create visualizations:

%%llm
Generate Python code to create these visualizations for my tips dataset:
1. A histogram of tip amounts
2. A scatter plot of bill amount vs. tip amount, colored by time of day
3. A box plot showing tip percentage across different days of the week
4. A heatmap of correlations between numerical variables

Make sure the code is well-commented and uses seaborn for better aesthetics.

Now execute the generated code in a new cell to see your visualizations.

๐Ÿ” Step 4: Targeted Data Analysisยถ

Ask for specific analyses based on your preliminary findings:

%%llm
Based on the tips dataset, generate code to analyze:
1. Whether there's a statistically significant difference in tip percentages between lunch and dinner
2. If the day of the week affects tipping behavior
3. The relationship between party size and average tip percentage

Include proper statistical tests and clear explanations of the results.

๐Ÿง  Step 5: Statistical Concept Explanationsยถ

When you encounter unfamiliar statistical concepts:

%%llm
Please explain the following concepts in the context of our tips dataset analysis:
1. What is a t-test and when should I use it?
2. What does p-value mean in our analysis?
3. What are the assumptions of linear regression?
4. How should I interpret the correlation coefficient?

Explain with simple examples using our restaurant tipping context.

๐Ÿ› ๏ธ Step 6: Data Cleaning and Preparationยถ

Get help with data cleaning and transformation:

%%llm
Generate code to clean and prepare the tips dataset:
1. Create a new 'tip_percentage' column (tip as a percentage of total bill)
2. Convert categorical variables to appropriate formats for modeling
3. Check for and handle any potential outliers
4. Split the data into training and testing sets for a model that predicts tip percentage

Include explanations for each preprocessing step.

๐Ÿค– Step 7: Model Building Assistanceยถ

Ask for help creating predictive models:

%%llm
Write code to build three different models to predict tip percentage:
1. A simple linear regression model
2. A random forest regressor
3. A gradient boosting model

Include feature selection, model training, evaluation metrics, and a comparison of the models' performance.

๐Ÿ“Š Step 8: Results Interpretationยถ

After running analyses, ask the LLM to help interpret results:

# First, run your analysis code and capture the output
# Then paste relevant results as context

%llm_config --snippet """
Model Results:
Linear Regression Rยฒ = 0.45
Random Forest Rยฒ = 0.67
Gradient Boosting Rยฒ = 0.72

Feature Importances:
- party_size: 0.35
- total_bill: 0.25
- day (Saturday): 0.15
- time (Dinner): 0.12
- smoker (Yes): 0.08
- day (Sunday): 0.05
"""

%%llm
Interpret these model results for predicting tip percentage:
1. Which model performed best and why?
2. What are the most important factors influencing tip amounts?
3. How would you explain these results to a restaurant manager?
4. What recommendations would you make based on this analysis?

๐Ÿ“ Step 9: Generating Analysis Reportsยถ

Use CellMage to help create professional reports:

%%llm
Based on our analysis of the restaurant tipping dataset, generate a structured report with these sections:
1. Executive Summary (2-3 paragraphs)
2. Methodology (brief description of data and approaches)
3. Key Findings (bullet points of 4-5 main discoveries)
4. Visualizations to Include (list the most important ones)
5. Business Recommendations (3-4 actionable insights)
6. Future Analysis Suggestions

Write this for a restaurant management audience.

๐Ÿ”„ Step 10: Refining Analysis Iterativelyยถ

Use the iterative nature of CellMage to refine your analysis:

%%llm
The scatter plot showing tip amount vs total bill revealed a potential non-linear relationship.
Suggest three different transformations we could apply, and generate code to:
1. Apply these transformations
2. Visualize the transformed relationships
3. Determine which transformation makes the relationship most linear
4. Rebuild our regression model with the best transformation

๐Ÿงช Advanced Data Analysis Applicationsยถ

Time Series Analysisยถ

# First load a time series dataset
ts_data = pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv')
ts_data['Month'] = pd.to_datetime(ts_data['Month'])
ts_data.set_index('Month', inplace=True)
ts_data.head()

%%llm
Generate comprehensive time series analysis code for this airline passenger dataset:
1. Decompose the series into trend, seasonal, and residual components
2. Test for stationarity and apply transformations if needed
3. Build and evaluate a forecasting model (ARIMA or Prophet)
4. Generate and visualize a 12-month forecast with confidence intervals
5. Explain key insights about seasonality and trends

Include well-commented code and explanations.

Exploratory Data Analysis (EDA) Pipelineยถ

%%llm
Create a reusable EDA function that I can apply to any dataset. The function should:
1. Provide basic statistics (missing values, duplicates, descriptive stats)
2. Generate appropriate visualizations based on data types
3. Identify potential outliers and unusual patterns
4. Examine correlations and relationships between variables
5. Suggest potential feature engineering steps
6. Output a summary of findings

Make it modular and well-commented so I can adapt it for future datasets.

Interactive Dashboard Code Generatorยถ

%%llm
Generate code for a simple Plotly Dash application that creates an interactive dashboard for the tips dataset with:
1. A dropdown to select variables for the x and y axes
2. Filter options for categorical variables (day, time, smoker)
3. Radio buttons to switch between different chart types
4. A statistics panel showing key metrics about the selected data
5. A data table showing the filtered records

Include all necessary code to run the application locally.

โš ๏ธ Limitations and Best Practicesยถ

Working with data and LLMs has some important considerations:

  1. Data privacy - Be careful not to share sensitive data with LLMs

  2. Verification is essential - Always review and validate generated code and conclusions

  3. Context limitations - Large datasets may exceed token limits; share summaries instead

  4. Statistical accuracy - Double-check statistical claims and methodologies

  5. Code execution - Test generated code on small data subsets before running on large datasets

๐Ÿšฆ Data Analysis Best Practicesยถ

  • Start broad, then narrow - Begin with exploratory questions before diving into specifics

  • Provide clear context - Share data types, sizes, and descriptive statistics

  • Ask for explanations - Request comments in generated code to improve understanding

  • Iterate on results - Use outputs to inform your next set of questions

  • Combine expertise - Use LLM suggestions alongside your domain knowledge

๐ŸŽ“ Whatโ€™s Next?ยถ

Now that youโ€™ve learned to use CellMage for data analysis:

May your data reveal its secrets through magical assistance! โœจ