🔬 Hypothesis Testing: Making Data-Driven Decisions

Imagine you're a judge in a courtroom ⚖️. The defendant is "innocent until proven guilty" - that's your null hypothesis. You need strong evidence to convict (reject the null). This is hypothesis testing! It's the scientific method for data, helping us make decisions when we can't be 100% certain. From medicine to marketing, hypothesis testing separates gut feelings from statistical facts.

The Foundation: Understanding Hypotheses 🎯

The Cookie Factory Analogy 🍪
A cookie factory claims their cookies weigh 50g on average. You're the quality inspector. Your null hypothesis (H₀) is "the cookies DO weigh 50g." Your alternative hypothesis (H₁) is "the cookies DON'T weigh 50g." You take a sample of cookies and weigh them. If they're way off 50g, you reject the factory's claim. But how far is "way off"? That's what hypothesis testing tells us!

graph TB A[Research Question] --> B[Formulate Hypotheses] B --> C[Null Hypothesis H₀] B --> D[Alternative Hypothesis H₁] C --> E[Collect Data] D --> E E --> F[Calculate Test Statistic] F --> G[Find P-value] G --> H{P-value < α?} H -->|Yes| I[Reject H₀] H -->|No| J[Fail to Reject H₀] I --> K[Evidence supports H₁] J --> L[Insufficient evidence] style A fill:#667eea style I fill:#10b981 style J fill:#fbbf24

H₀: Null Hypothesis

The "status quo" or "no effect" assumption. Examples:

The new drug has no effect
There is no difference between groups
The correlation is zero
The mean equals a specific value

H₁: Alternative Hypothesis

What you're trying to prove. Can be:

Two-tailed: μ ≠ μ₀ (different from)
Left-tailed: μ < μ₀ (less than)
Right-tailed: μ > μ₀ (greater than)

Types of Errors: The Risk Trade-off ⚠️

In hypothesis testing, we can make two types of mistakes. Think of it like a smoke detector: it can fail by alarming when there's no fire (Type I) or by staying silent when there IS a fire (Type II).

Type I Error (α)

False Positive

🚨 Rejecting H₀ when it's actually TRUE

Example: Convicting an innocent person

Probability: α (significance level)

Typically set at 0.05 (5%)

Type II Error (β)

False Negative

😴 Failing to reject H₀ when it's actually FALSE

Example: Acquitting a guilty person

Probability: β

Related to Power: 1 - β

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for visualizations
sns.set_style("whitegrid")
np.random.seed(42)

class HypothesisTestingFramework:
    """
    Your complete guide to hypothesis testing!
    Master the art of statistical inference 🔬
    """
    
    def __init__(self):
        self.alpha = 0.05  # Standard significance level
        
    def understand_p_values(self):
        """
        Demystifying p-values: What they really mean
        """
        print("🎯 UNDERSTANDING P-VALUES")
        print("=" * 60)
        
        print("📊 What is a p-value?")
        print("The probability of observing data at least as extreme as what we got,")
        print("assuming the null hypothesis is TRUE.")
        print("\n⚠️ Common Misconceptions:")
        print("❌ P-value is NOT the probability that H₀ is true")
        print("❌ P-value is NOT the probability of making an error")
        print("❌ 1-p is NOT the probability that H₁ is true")
        
        print("\n📏 P-value Interpretation Scale:")
        print("-" * 40)
        
        p_values = [0.001, 0.01, 0.05, 0.10, 0.50]
        interpretations = [
            "Very strong evidence against H₀",
            "Strong evidence against H₀",
            "Moderate evidence against H₀",
            "Weak evidence against H₀",
            "No evidence against H₀"
        ]
        
        for p, interp in zip(p_values, interpretations):
            symbol = "⭐⭐⭐" if p <= 0.001 else "⭐⭐" if p <= 0.01 else "⭐" if p <= 0.05 else "○"
            print(f"p = {p:.3f}: {interp} {symbol}")
        
        # Simulation example
        print("\n🔬 Simulation: Coin Fairness Test")
        print("-" * 40)
        
        # Test if a coin is fair (H₀: p = 0.5)
        n_flips = 100
        observed_heads = 60  # We observed 60 heads
        
        # Exact test using binomial
        p_value = stats.binom_test(observed_heads, n_flips, p=0.5, alternative='two-sided')
        
        print(f"Flips: {n_flips}, Heads: {observed_heads}")
        print(f"H₀: Coin is fair (p = 0.5)")
        print(f"H₁: Coin is not fair (p ≠ 0.5)")
        print(f"P-value: {p_value:.4f}")
        
        if p_value < self.alpha:
            print(f"✅ Reject H₀: Evidence suggests coin is unfair")
        else:
            print(f"❌ Fail to reject H₀: Insufficient evidence of unfairness")
    
    def one_sample_t_test(self):
        """
        One-sample t-test: Testing a population mean
        """
        print("\n📈 ONE-SAMPLE T-TEST")
        print("=" * 60)
        
        # Scenario: Coffee shop claims average wait time is 3 minutes
        claimed_mean = 3.0
        
        # Sample data (actual wait times in minutes)
        wait_times = np.array([2.5, 3.2, 4.1, 2.8, 3.5, 3.9, 2.7, 3.3, 4.2, 3.6,
                               2.9, 3.4, 3.8, 2.6, 3.7, 4.0, 3.1, 2.8, 3.5, 3.2])
        
        print("☕ Coffee Shop Wait Time Analysis")
        print(f"Claimed average: {claimed_mean} minutes")
        print(f"Sample size: {len(wait_times)}")
        print(f"Sample mean: {wait_times.mean():.2f} minutes")
        print(f"Sample std: {wait_times.std(ddof=1):.2f} minutes")
        
        # Perform t-test
        t_stat, p_value = stats.ttest_1samp(wait_times, claimed_mean)
        
        print(f"\nTest Results:")
        print(f"t-statistic: {t_stat:.4f}")
        print(f"p-value: {p_value:.4f}")
        
        # Calculate confidence interval
        confidence = 0.95
        ci = stats.t.interval(confidence, len(wait_times)-1, 
                             loc=wait_times.mean(), 
                             scale=wait_times.std(ddof=1)/np.sqrt(len(wait_times)))
        
        print(f"\n{confidence*100:.0f}% Confidence Interval: [{ci[0]:.2f}, {ci[1]:.2f}]")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print(f"✅ Reject H₀: Wait time is significantly different from {claimed_mean} min")
            print(f"   The coffee shop's claim appears to be false!")
        else:
            print(f"❌ Fail to reject H₀: No significant difference from {claimed_mean} min")
            print(f"   The coffee shop's claim is supported by the data.")
        
        # Effect size (Cohen's d)
        cohens_d = (wait_times.mean() - claimed_mean) / wait_times.std(ddof=1)
        print(f"\n📊 Effect Size (Cohen's d): {cohens_d:.3f}")
        if abs(cohens_d) < 0.2:
            print("   Negligible effect")
        elif abs(cohens_d) < 0.5:
            print("   Small effect")
        elif abs(cohens_d) < 0.8:
            print("   Medium effect")
        else:
            print("   Large effect")
    
    def two_sample_t_test(self):
        """
        Two-sample t-test: Comparing two groups
        """
        print("\n📊 TWO-SAMPLE T-TEST")
        print("=" * 60)
        
        # Scenario: Comparing effectiveness of two teaching methods
        np.random.seed(42)
        
        # Test scores for two groups
        method_a = np.random.normal(75, 10, 30)  # Traditional method
        method_b = np.random.normal(80, 12, 35)  # New method
        
        print("🎓 Teaching Methods Comparison")
        print(f"Method A (Traditional): n={len(method_a)}, mean={method_a.mean():.1f}, std={method_a.std():.1f}")
        print(f"Method B (New):         n={len(method_b)}, mean={method_b.mean():.1f}, std={method_b.std():.1f}")
        
        # Test for equal variances (Levene's test)
        levene_stat, levene_p = stats.levene(method_a, method_b)
        print(f"\n📐 Levene's Test for Equal Variances:")
        print(f"   p-value: {levene_p:.4f}")
        
        equal_var = levene_p > 0.05
        print(f"   Variances are {'equal' if equal_var else 'unequal'} (use equal_var={equal_var})")
        
        # Perform t-test
        t_stat, p_value = stats.ttest_ind(method_a, method_b, equal_var=equal_var)
        
        print(f"\n📊 Two-Sample T-Test Results:")
        print(f"   t-statistic: {t_stat:.4f}")
        print(f"   p-value: {p_value:.4f}")
        
        # Effect size (Cohen's d)
        pooled_std = np.sqrt(((len(method_a)-1)*method_a.var() + (len(method_b)-1)*method_b.var()) / 
                             (len(method_a) + len(method_b) - 2))
        cohens_d = (method_b.mean() - method_a.mean()) / pooled_std
        
        print(f"   Cohen's d: {cohens_d:.3f}")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print("✅ Reject H₀: Significant difference between methods")
            if method_b.mean() > method_a.mean():
                print("   The new method appears to be better!")
            else:
                print("   The traditional method appears to be better!")
        else:
            print("❌ Fail to reject H₀: No significant difference between methods")
    
    def paired_t_test(self):
        """
        Paired t-test: Before and after comparisons
        """
        print("\n🔄 PAIRED T-TEST")
        print("=" * 60)
        
        # Scenario: Weight loss program (before and after)
        np.random.seed(42)
        
        n_participants = 20
        before_weight = np.random.normal(180, 20, n_participants)
        # After weight (most lose weight, but with variation)
        weight_loss = np.random.normal(8, 5, n_participants)
        after_weight = before_weight - weight_loss
        
        print("🏃 Weight Loss Program Analysis")
        print(f"Participants: {n_participants}")
        print(f"Before: mean={before_weight.mean():.1f} lbs, std={before_weight.std():.1f}")
        print(f"After:  mean={after_weight.mean():.1f} lbs, std={after_weight.std():.1f}")
        print(f"Average loss: {(before_weight - after_weight).mean():.1f} lbs")
        
        # Perform paired t-test
        t_stat, p_value = stats.ttest_rel(before_weight, after_weight)
        
        print(f"\n📊 Paired T-Test Results:")
        print(f"   t-statistic: {t_stat:.4f}")
        print(f"   p-value: {p_value:.4f}")
        
        # Calculate confidence interval for the difference
        differences = before_weight - after_weight
        ci = stats.t.interval(0.95, len(differences)-1,
                             loc=differences.mean(),
                             scale=differences.std(ddof=1)/np.sqrt(len(differences)))
        
        print(f"\n95% CI for weight loss: [{ci[0]:.1f}, {ci[1]:.1f}] lbs")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print("✅ Reject H₀: Significant weight loss detected")
            print(f"   The program is effective! Average loss: {differences.mean():.1f} lbs")
        else:
            print("❌ Fail to reject H₀: No significant weight loss")
    
    def chi_square_test(self):
        """
        Chi-square test: Testing categorical relationships
        """
        print("\n🎲 CHI-SQUARE TEST")
        print("=" * 60)
        
        # Scenario: Is customer satisfaction related to age group?
        # Creating contingency table
        data = pd.DataFrame({
            'Age_Group': ['18-30'] * 50 + ['31-50'] * 60 + ['51+'] * 40,
            'Satisfaction': np.random.choice(['Low', 'Medium', 'High'], 150, 
                                           p=[0.2, 0.5, 0.3])
        })
        
        # Add some relationship (older = more satisfied)
        data.loc[(data['Age_Group'] == '51+') & (data['Satisfaction'] == 'Low'), 'Satisfaction'] = 'High'
        
        # Create contingency table
        cont_table = pd.crosstab(data['Age_Group'], data['Satisfaction'])
        
        print("📊 Customer Satisfaction by Age Group")
        print(cont_table)
        
        # Perform chi-square test
        chi2, p_value, dof, expected = stats.chi2_contingency(cont_table)
        
        print(f"\n📈 Chi-Square Test Results:")
        print(f"   χ² statistic: {chi2:.4f}")
        print(f"   p-value: {p_value:.4f}")
        print(f"   Degrees of freedom: {dof}")
        
        # Calculate Cramér's V (effect size)
        n = cont_table.sum().sum()
        min_dim = min(cont_table.shape[0] - 1, cont_table.shape[1] - 1)
        cramers_v = np.sqrt(chi2 / (n * min_dim))
        
        print(f"   Cramér's V: {cramers_v:.3f}")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print("✅ Reject H₀: Satisfaction and age are related")
            print("   There's a significant association between age and satisfaction!")
        else:
            print("❌ Fail to reject H₀: No significant relationship")

# Run demonstrations
testing = HypothesisTestingFramework()
testing.understand_p_values()
testing.one_sample_t_test()
testing.two_sample_t_test()
testing.paired_t_test()
testing.chi_square_test()

The P-Value Scale: Making Decisions 📊

p = 0.02

p < 0.01
Very Strong p < 0.05
Strong p < 0.10
Moderate p ≥ 0.10
Weak/None

Choosing the Right Test 🎯

Test Type	Use Case	Data Type	Assumptions	Example
One-Sample t-test	Compare sample mean to known value	Continuous	Normal distribution	Is average height 170cm?
Two-Sample t-test	Compare means of two groups	Continuous	Normal, equal variances	Drug A vs Drug B
Paired t-test	Before/after comparisons	Continuous	Normal differences	Weight before/after diet
Chi-Square	Test independence	Categorical	Expected freq ≥ 5	Gender vs Product preference
ANOVA	Compare 3+ groups	Continuous	Normal, equal variances	Compare 3 teaching methods
Mann-Whitney U	Non-parametric alternative	Ordinal/Continuous	No normality required	Compare ratings

import numpy as np
from scipy import stats
import pandas as pd

class TestSelection:
    """
    Smart test selection based on your data
    """
    
    def select_test(self, data_type, n_groups, paired=False, normal=True):
        """
        Decision tree for test selection
        """
        if data_type == 'categorical':
            if n_groups == 2:
                return "Chi-square test or Fisher's exact test"
            else:
                return "Chi-square test"
        
        elif data_type == 'continuous':
            if n_groups == 1:
                if normal:
                    return "One-sample t-test"
                else:
                    return "Wilcoxon signed-rank test"
            
            elif n_groups == 2:
                if paired:
                    if normal:
                        return "Paired t-test"
                    else:
                        return "Wilcoxon signed-rank test"
                else:
                    if normal:
                        return "Two-sample t-test"
                    else:
                        return "Mann-Whitney U test"
            
            else:  # n_groups > 2
                if normal:
                    return "One-way ANOVA"
                else:
                    return "Kruskal-Wallis test"
        
        return "Consult a statistician!"
    
    def normality_tests(self, data):
        """
        Test if data follows normal distribution
        """
        print("🔍 NORMALITY TESTS")
        print("=" * 40)
        
        # Shapiro-Wilk test (best for small samples)
        if len(data) <= 5000:
            stat, p_value = stats.shapiro(data)
            print(f"Shapiro-Wilk Test:")
            print(f"  Statistic: {stat:.4f}")
            print(f"  P-value: {p_value:.4f}")
            
            if p_value > 0.05:
                print("  ✅ Data appears to be normal")
            else:
                print("  ❌ Data is not normally distributed")
        
        # Kolmogorov-Smirnov test
        stat, p_value = stats.kstest(data, 'norm', 
                                     args=(data.mean(), data.std()))
        print(f"\nKolmogorov-Smirnov Test:")
        print(f"  Statistic: {stat:.4f}")
        print(f"  P-value: {p_value:.4f}")
        
        # Anderson-Darling test
        result = stats.anderson(data)
        print(f"\nAnderson-Darling Test:")
        print(f"  Statistic: {result.statistic:.4f}")
        print(f"  Critical values: {result.critical_values}")
        
        return p_value > 0.05

# Example usage
selector = TestSelection()

# Test selection examples
print("📋 TEST SELECTION GUIDE")
print("=" * 40)

scenarios = [
    ("Continuous", 1, False, True),
    ("Continuous", 2, False, True),
    ("Continuous", 2, True, True),
    ("Continuous", 3, False, True),
    ("Continuous", 2, False, False),
    ("Categorical", 2, False, None)
]

for data_type, n_groups, paired, normal in scenarios:
    test = selector.select_test(data_type, n_groups, paired, normal)
    print(f"\nData: {data_type}, Groups: {n_groups}, Paired: {paired}, Normal: {normal}")
    print(f"→ Use: {test}")

# Check normality of sample data
print("\n" + "=" * 40)
sample_data = np.random.normal(100, 15, 100)
selector.normality_tests(sample_data)

Power Analysis: Planning Your Study 💪

Statistical Power = 1 - β

The probability of correctly rejecting a false null hypothesis.

graph LR A[Power Analysis] --> B[Sample Size] A --> C[Effect Size] A --> D[Significance Level α] A --> E[Power 1-β] B --> F[How many subjects?] C --> G[How big is the difference?] D --> H[Type I error rate] E --> I[Type II error rate] style A fill:#0284c7 style E fill:#10b981

import numpy as np
from statsmodels.stats.power import TTestPower
import matplotlib.pyplot as plt

class PowerAnalysis:
    """
    Plan your study with proper power analysis
    """
    
    def sample_size_calculation(self):
        """
        Calculate required sample size
        """
        print("📊 SAMPLE SIZE CALCULATION")
        print("=" * 40)
        
        # Parameters
        effect_size = 0.5  # Medium effect
        alpha = 0.05
        power = 0.8
        
        # Calculate sample size
        analysis = TTestPower()
        
        # For two-sample t-test
        n = analysis.solve_power(effect_size=effect_size,
                               alpha=alpha,
                               power=power,
                               ratio=1,  # Equal group sizes
                               alternative='two-sided')
        
        print(f"Parameters:")
        print(f"  Effect size (Cohen's d): {effect_size}")
        print(f"  Significance level (α): {alpha}")
        print(f"  Desired power (1-β): {power}")
        
        print(f"\n✅ Required sample size per group: {np.ceil(n):.0f}")
        print(f"   Total sample size: {np.ceil(n*2):.0f}")
        
        # Show how power changes with sample size
        print("\n📈 Power vs Sample Size:")
        sample_sizes = [10, 20, 30, 50, 100, 200]
        
        for n in sample_sizes:
            power = analysis.solve_power(effect_size=effect_size,
                                       alpha=alpha,
                                       nobs1=n,
                                       ratio=1,
                                       alternative='two-sided')
            print(f"  n={n:3d}: Power = {power:.3f}")
        
        # Effect size guidelines
        print("\n📏 Cohen's d Effect Size Guidelines:")
        print("  Small:  d = 0.2")
        print("  Medium: d = 0.5")
        print("  Large:  d = 0.8")
    
    def minimum_detectable_effect(self):
        """
        What effect size can we detect with our sample?
        """
        print("\n🔍 MINIMUM DETECTABLE EFFECT")
        print("=" * 40)
        
        # Given constraints
        n_available = 50
        alpha = 0.05
        power = 0.8
        
        analysis = TTestPower()
        effect_size = analysis.solve_power(nobs1=n_available,
                                          alpha=alpha,
                                          power=power,
                                          ratio=1,
                                          alternative='two-sided')
        
        print(f"Given:")
        print(f"  Available sample size: {n_available} per group")
        print(f"  Significance level: {alpha}")
        print(f"  Desired power: {power}")
        
        print(f"\n✅ Minimum detectable effect size: {effect_size:.3f}")
        
        if effect_size < 0.2:
            print("   Can detect even small effects!")
        elif effect_size < 0.5:
            print("   Can detect medium and large effects")
        elif effect_size < 0.8:
            print("   Can only detect large effects")
        else:
            print("   ⚠️ Sample too small for reasonable power")

# Run power analysis
power_analysis = PowerAnalysis()
power_analysis.sample_size_calculation()
power_analysis.minimum_detectable_effect()

Multiple Testing Correction 🔄

The Multiple Comparisons Problem

If you run 20 tests at α = 0.05, you expect 1 false positive by chance!

Solutions: Bonferroni, Holm, FDR corrections

import numpy as np
from statsmodels.stats.multitest import multipletests

class MultipleTestingCorrection:
    """
    Handle multiple comparisons properly
    """
    
    def demonstrate_problem(self):
        """
        Show why we need correction
        """
        print("🔄 MULTIPLE TESTING PROBLEM")
        print("=" * 40)
        
        # Simulate 20 tests with no real effect (all null true)
        np.random.seed(42)
        n_tests = 20
        p_values = np.random.uniform(0, 1, n_tests)
        
        # Sort for display
        p_values = np.sort(p_values)
        
        print(f"Running {n_tests} tests at α = 0.05")
        print("\nUncorrected results:")
        
        significant = p_values < 0.05
        print(f"  Significant tests: {significant.sum()}")
        print(f"  False positive rate: {significant.sum()/n_tests*100:.1f}%")
        
        if significant.any():
            print(f"  Smallest p-value: {p_values[0]:.4f}")
        
        # Apply corrections
        print("\n📊 With Multiple Testing Corrections:")
        
        methods = ['bonferroni', 'holm', 'fdr_bh']
        method_names = ['Bonferroni', 'Holm-Bonferroni', 'Benjamini-Hochberg FDR']
        
        for method, name in zip(methods, method_names):
            reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method=method)
            
            print(f"\n{name}:")
            print(f"  Significant after correction: {reject.sum()}")
            print(f"  Adjusted α: {0.05/n_tests if method=='bonferroni' else 'varies'}")
        
        print("\n💡 Insights:")
        print("  • Bonferroni: Most conservative (α/n)")
        print("  • Holm: Slightly less conservative")
        print("  • FDR: Controls false discovery rate")
        print("  • Use FDR for exploratory analyses")

# Demonstrate
mtc = MultipleTestingCorrection()
mtc.demonstrate_problem()

Practical Decision Framework 🎯

Your Hypothesis Testing Checklist

✅ Define H₀ and H₁ clearly
✅ Choose significance level α (usually 0.05)
✅ Check test assumptions (normality, independence)
✅ Select appropriate test
✅ Calculate test statistic and p-value
✅ Make decision (reject or fail to reject)
✅ Report effect size and confidence interval
✅ Consider practical significance

Common Pitfalls and Best Practices 🚧

print("⚠️ COMMON PITFALLS TO AVOID")
print("=" * 40)

# Pitfall 1: P-hacking
print("\n❌ P-hacking (Data Dredging)")
print("Testing multiple hypotheses until you find p < 0.05")
print("✅ Solution: Pre-register hypotheses, use corrections")

# Pitfall 2: Ignoring effect size
print("\n❌ Focusing only on p-values")
print("Statistical significance ≠ Practical significance")
print("✅ Solution: Always report effect sizes and CIs")

# Pitfall 3: Violating assumptions
print("\n❌ Using parametric tests on non-normal data")
print("✅ Solution: Check assumptions, use non-parametric alternatives")

# Pitfall 4: Misinterpreting p-values
print("\n❌ 'p = 0.04 means 4% chance H₀ is true'")
print("✅ Correct: 'If H₀ is true, 4% chance of this extreme data'")

# Pitfall 5: Publication bias
print("\n❌ Only publishing significant results")
print("✅ Solution: Report all results, including non-significant")

# Best Practices
print("\n✨ BEST PRACTICES")
print("=" * 40)
print("1. Plan sample size with power analysis")
print("2. State hypotheses before collecting data")
print("3. Report exact p-values (not just p < 0.05)")
print("4. Include confidence intervals")
print("5. Consider multiple testing corrections")
print("6. Replicate important findings")
print("7. Share data and code for transparency")

Summary: Your Hypothesis Testing Toolkit ✅

🎯 Key Takeaways:

Null Hypothesis (H₀): The default assumption of no effect
P-value: Probability of data given H₀ is true
Type I Error: False positive (α)
Type II Error: False negative (β)
Power: Probability of detecting true effect (1-β)
Effect Size: Magnitude of the difference
Always report: Test used, p-value, effect size, CI

# Quick Reference Card - Hypothesis Testing
from scipy import stats
import numpy as np

# ONE-SAMPLE T-TEST
# Test if population mean equals a value
data = np.array([...])
null_mean = 100
t_stat, p_value = stats.ttest_1samp(data, null_mean)

# TWO-SAMPLE T-TEST
# Compare means of two independent groups
group1 = np.array([...])
group2 = np.array([...])
t_stat, p_value = stats.ttest_ind(group1, group2)

# PAIRED T-TEST
# Compare paired observations (before/after)
before = np.array([...])
after = np.array([...])
t_stat, p_value = stats.ttest_rel(before, after)

# CHI-SQUARE TEST
# Test independence of categorical variables
contingency_table = [[10, 20], [30, 40]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# ANOVA
# Compare means of 3+ groups
group1, group2, group3 = [...], [...], [...]
f_stat, p_value = stats.f_oneway(group1, group2, group3)

# NON-PARAMETRIC ALTERNATIVES
# Mann-Whitney U (alternative to two-sample t)
u_stat, p_value = stats.mannwhitneyu(group1, group2)

# Wilcoxon signed-rank (alternative to paired t)
w_stat, p_value = stats.wilcoxon(before, after)

# EFFECT SIZE CALCULATIONS
# Cohen's d
cohens_d = (mean1 - mean2) / pooled_std

# CONFIDENCE INTERVALS
# 95% CI for mean
from scipy import stats
ci = stats.t.interval(0.95, len(data)-1, 
                      loc=np.mean(data), 
                      scale=stats.sem(data))

print("🔬 Master hypothesis testing for confident decisions!")