Skip to main content

šŸ“Š Descriptive Statistics: Understanding Your Data's Story

Imagine you're a detective šŸ•µļø investigating a crime scene. Before making any conclusions, you need to gather evidence, examine patterns, and understand what happened. That's exactly what descriptive statistics does for data! It's your first tool in the data science toolkit - helping you summarize, visualize, and understand the fundamental characteristics of your dataset before diving into complex analyses.

The Big Picture: What Are Descriptive Statistics? šŸŽÆ

Descriptive statistics are like a data's passport - they tell you the essential information about your dataset at a glance. Instead of looking at thousands of individual data points, you get a concise summary that captures the essence of your data's behavior.

graph TB A[Raw Data] --> B[Descriptive Statistics] B --> C[Central Tendency] B --> D[Variability/Spread] B --> E[Distribution Shape] B --> F[Relationships] C --> G[Mean] C --> H[Median] C --> I[Mode] D --> J[Range] D --> K[Variance] D --> L[Standard Deviation] D --> M[IQR] E --> N[Skewness] E --> O[Kurtosis] F --> P[Correlation] F --> Q[Covariance] style B fill:#667eea style C fill:#4ecdc4 style D fill:#fbbf24 style E fill:#10b981 style F fill:#f472b6

The Restaurant Review Analogy šŸ•

Think of descriptive statistics like restaurant reviews. Instead of reading 1,000 individual reviews, you want to know: What's the average rating (mean)? What rating appears most often (mode)? What's the typical rating if we ignore extremes (median)? How consistent are the ratings (standard deviation)? Are there many terrible or amazing outliers (skewness)? This summary gives you the restaurant's "statistical story" instantly!

Common Pitfalls and How to Avoid Them 🚧

Even experienced data scientists make these mistakes. Let's learn how to avoid them!

import numpy as np
import pandas as pd

print("āš ļø Common Pitfalls in Descriptive Statistics")
print("=" * 60)

# Pitfall 1: Mean vs Median Confusion
print("\nāŒ Pitfall 1: Using Mean for Skewed Data")
print("-" * 40)

# Income data (highly skewed)
incomes = np.concatenate([
    np.random.normal(50000, 10000, 900),  # Most people
    np.random.normal(500000, 100000, 100)  # High earners
])

mean_income = np.mean(incomes)
median_income = np.median(incomes)

print(f"Mean income: ${mean_income:,.2f}")
print(f"Median income: ${median_income:,.2f}")
print(f"Difference: ${mean_income - median_income:,.2f}")
print("\nāœ… Solution: Use median for skewed distributions!")

# Pitfall 2: Ignoring Missing Data
print("\nāŒ Pitfall 2: Silent Missing Data")
print("-" * 40)

data_with_nulls = pd.Series([1, 2, 3, np.nan, 5, np.nan, 7, 8, 9, 10])

print(f"Mean (with NaN): {data_with_nulls.mean():.2f}")
print(f"Count without checking: {len(data_with_nulls)}")
print(f"Count of valid values: {data_with_nulls.count()}")
print(f"Percentage missing: {data_with_nulls.isnull().sum()/len(data_with_nulls)*100:.1f}%")
print("\nāœ… Solution: Always check for and handle missing values!")

# Pitfall 3: Simpson's Paradox
print("\nāŒ Pitfall 3: Simpson's Paradox")
print("-" * 40)

# Example: Treatment success rates
hospital_data = pd.DataFrame({
    'hospital': ['A'] * 1000 + ['B'] * 1000,
    'severity': ['mild'] * 800 + ['severe'] * 200 + ['mild'] * 200 + ['severe'] * 800,
    'success': np.concatenate([
        np.random.binomial(1, 0.9, 800),   # Hospital A, mild cases
        np.random.binomial(1, 0.3, 200),   # Hospital A, severe cases
        np.random.binomial(1, 0.85, 200),  # Hospital B, mild cases
        np.random.binomial(1, 0.25, 800)   # Hospital B, severe cases
    ])
})

# Overall success rates
overall = hospital_data.groupby('hospital')['success'].mean()
print("Overall success rates:")
print(overall)
print("\nāš ļø Hospital A looks better overall!")

# But when we control for severity...
by_severity = hospital_data.groupby(['hospital', 'severity'])['success'].mean().unstack()
print("\nSuccess rates by severity:")
print(by_severity)
print("\nāœ… Hospital B is actually better for both mild AND severe cases!")
print("   This reversal is Simpson's Paradox!")

Quick Reference Guide šŸ“š

šŸ“ Central Tendency

Mean: Average value

Median: Middle value

Mode: Most frequent

Use median for skewed data!

šŸ“ Spread

Range: Max - Min

IQR: Q3 - Q1

Std Dev: Average deviation

IQR is robust to outliers!

šŸ“ Shape

Skewness: Asymmetry

Kurtosis: Tail heaviness

Skew > 0 = Right tail

Kurt > 0 = Heavy tails

šŸ”— Relationships

Correlation: Linear relationship

Covariance: Joint variability

-1 ≤ correlation ≤ 1

Correlation ≠ Causation!

Summary: Your Statistical Toolkit āœ…

šŸŽÆ Key Takeaways:

# Quick Reference Card for Descriptive Statistics
import pandas as pd
import numpy as np
from scipy import stats

# ESSENTIAL FUNCTIONS
# ------------------

# For a DataFrame
df.describe()                # Basic statistics
df.info()                    # Data types and missing values
df.shape                     # Dimensions
df.dtypes                    # Column types

# Central Tendency
df['col'].mean()            # Average
df['col'].median()          # Middle value
df['col'].mode()            # Most frequent

# Spread
df['col'].std()             # Standard deviation
df['col'].var()             # Variance
df['col'].quantile([0.25, 0.75])  # Quartiles

# Distribution Shape
df['col'].skew()            # Skewness
df['col'].kurtosis()        # Kurtosis

# Relationships
df.corr()                   # Correlation matrix
df.cov()                    # Covariance matrix

# Grouped Statistics
df.groupby('category')['value'].agg(['mean', 'std', 'count'])

# Missing Data
df.isnull().sum()           # Count missing
df.dropna()                 # Remove missing
df.fillna(method='mean')    # Fill missing

print("šŸ“Š You're now equipped with descriptive statistics mastery!")