š Descriptive Statistics: Understanding Your Data's Story
Imagine you're a detective šµļø investigating a crime scene. Before making any conclusions, you need to gather evidence, examine patterns, and understand what happened. That's exactly what descriptive statistics does for data! It's your first tool in the data science toolkit - helping you summarize, visualize, and understand the fundamental characteristics of your dataset before diving into complex analyses.
The Big Picture: What Are Descriptive Statistics? šÆ
Descriptive statistics are like a data's passport - they tell you the essential information about your dataset at a glance. Instead of looking at thousands of individual data points, you get a concise summary that captures the essence of your data's behavior.
The Restaurant Review Analogy š
Think of descriptive statistics like restaurant reviews. Instead of reading 1,000 individual reviews, you want to know: What's the average rating (mean)? What rating appears most often (mode)? What's the typical rating if we ignore extremes (median)? How consistent are the ratings (standard deviation)? Are there many terrible or amazing outliers (skewness)? This summary gives you the restaurant's "statistical story" instantly!
Common Pitfalls and How to Avoid Them š§
Even experienced data scientists make these mistakes. Let's learn how to avoid them!
import numpy as np
import pandas as pd
print("ā ļø Common Pitfalls in Descriptive Statistics")
print("=" * 60)
# Pitfall 1: Mean vs Median Confusion
print("\nā Pitfall 1: Using Mean for Skewed Data")
print("-" * 40)
# Income data (highly skewed)
incomes = np.concatenate([
np.random.normal(50000, 10000, 900), # Most people
np.random.normal(500000, 100000, 100) # High earners
])
mean_income = np.mean(incomes)
median_income = np.median(incomes)
print(f"Mean income: ${mean_income:,.2f}")
print(f"Median income: ${median_income:,.2f}")
print(f"Difference: ${mean_income - median_income:,.2f}")
print("\nā
Solution: Use median for skewed distributions!")
# Pitfall 2: Ignoring Missing Data
print("\nā Pitfall 2: Silent Missing Data")
print("-" * 40)
data_with_nulls = pd.Series([1, 2, 3, np.nan, 5, np.nan, 7, 8, 9, 10])
print(f"Mean (with NaN): {data_with_nulls.mean():.2f}")
print(f"Count without checking: {len(data_with_nulls)}")
print(f"Count of valid values: {data_with_nulls.count()}")
print(f"Percentage missing: {data_with_nulls.isnull().sum()/len(data_with_nulls)*100:.1f}%")
print("\nā
Solution: Always check for and handle missing values!")
# Pitfall 3: Simpson's Paradox
print("\nā Pitfall 3: Simpson's Paradox")
print("-" * 40)
# Example: Treatment success rates
hospital_data = pd.DataFrame({
'hospital': ['A'] * 1000 + ['B'] * 1000,
'severity': ['mild'] * 800 + ['severe'] * 200 + ['mild'] * 200 + ['severe'] * 800,
'success': np.concatenate([
np.random.binomial(1, 0.9, 800), # Hospital A, mild cases
np.random.binomial(1, 0.3, 200), # Hospital A, severe cases
np.random.binomial(1, 0.85, 200), # Hospital B, mild cases
np.random.binomial(1, 0.25, 800) # Hospital B, severe cases
])
})
# Overall success rates
overall = hospital_data.groupby('hospital')['success'].mean()
print("Overall success rates:")
print(overall)
print("\nā ļø Hospital A looks better overall!")
# But when we control for severity...
by_severity = hospital_data.groupby(['hospital', 'severity'])['success'].mean().unstack()
print("\nSuccess rates by severity:")
print(by_severity)
print("\nā
Hospital B is actually better for both mild AND severe cases!")
print(" This reversal is Simpson's Paradox!")
Quick Reference Guide š
š Central Tendency
Mean: Average value
Median: Middle value
Mode: Most frequent
Use median for skewed data!
š Spread
Range: Max - Min
IQR: Q3 - Q1
Std Dev: Average deviation
IQR is robust to outliers!
š Shape
Skewness: Asymmetry
Kurtosis: Tail heaviness
Skew > 0 = Right tail
Kurt > 0 = Heavy tails
š Relationships
Correlation: Linear relationship
Covariance: Joint variability
-1 ⤠correlation ⤠1
Correlation ā Causation!
Summary: Your Statistical Toolkit ā
šÆ Key Takeaways:
- Always Start Here: Descriptive statistics are your first step in any analysis
- Know Your Measures: Different statistics tell different stories - use the right one
- Check Distribution: Skewness and outliers affect which statistics to use
- Visualize Everything: Graphs reveal patterns numbers can't show
- Group Analysis: Statistics by groups reveal hidden insights
- Be Robust: Use median and IQR when dealing with outliers
- Document Everything: Keep track of your statistical decisions and why
# Quick Reference Card for Descriptive Statistics
import pandas as pd
import numpy as np
from scipy import stats
# ESSENTIAL FUNCTIONS
# ------------------
# For a DataFrame
df.describe() # Basic statistics
df.info() # Data types and missing values
df.shape # Dimensions
df.dtypes # Column types
# Central Tendency
df['col'].mean() # Average
df['col'].median() # Middle value
df['col'].mode() # Most frequent
# Spread
df['col'].std() # Standard deviation
df['col'].var() # Variance
df['col'].quantile([0.25, 0.75]) # Quartiles
# Distribution Shape
df['col'].skew() # Skewness
df['col'].kurtosis() # Kurtosis
# Relationships
df.corr() # Correlation matrix
df.cov() # Covariance matrix
# Grouped Statistics
df.groupby('category')['value'].agg(['mean', 'std', 'count'])
# Missing Data
df.isnull().sum() # Count missing
df.dropna() # Remove missing
df.fillna(method='mean') # Fill missing
print("š You're now equipped with descriptive statistics mastery!")