Skip to main content

πŸ”¬ Hypothesis Testing: Making Data-Driven Decisions

Imagine you're a judge in a courtroom βš–οΈ. The defendant is "innocent until proven guilty" - that's your null hypothesis. You need strong evidence to convict (reject the null). This is hypothesis testing! It's the scientific method for data, helping us make decisions when we can't be 100% certain. From medicine to marketing, hypothesis testing separates gut feelings from statistical facts.

The Foundation: Understanding Hypotheses 🎯

The Cookie Factory Analogy πŸͺ
A cookie factory claims their cookies weigh 50g on average. You're the quality inspector. Your null hypothesis (Hβ‚€) is "the cookies DO weigh 50g." Your alternative hypothesis (H₁) is "the cookies DON'T weigh 50g." You take a sample of cookies and weigh them. If they're way off 50g, you reject the factory's claim. But how far is "way off"? That's what hypothesis testing tells us!
graph TB A[Research Question] --> B[Formulate Hypotheses] B --> C[Null Hypothesis Hβ‚€] B --> D[Alternative Hypothesis H₁] C --> E[Collect Data] D --> E E --> F[Calculate Test Statistic] F --> G[Find P-value] G --> H{P-value < Ξ±?} H -->|Yes| I[Reject Hβ‚€] H -->|No| J[Fail to Reject Hβ‚€] I --> K[Evidence supports H₁] J --> L[Insufficient evidence] style A fill:#667eea style I fill:#10b981 style J fill:#fbbf24

Hβ‚€: Null Hypothesis

The "status quo" or "no effect" assumption. Examples:

H₁: Alternative Hypothesis

What you're trying to prove. Can be:

Types of Errors: The Risk Trade-off ⚠️

In hypothesis testing, we can make two types of mistakes. Think of it like a smoke detector: it can fail by alarming when there's no fire (Type I) or by staying silent when there IS a fire (Type II).

Type I Error (Ξ±)

False Positive

🚨 Rejecting Hβ‚€ when it's actually TRUE

Example: Convicting an innocent person

Probability: Ξ± (significance level)

Typically set at 0.05 (5%)

Type II Error (Ξ²)

False Negative

😴 Failing to reject Hβ‚€ when it's actually FALSE

Example: Acquitting a guilty person

Probability: Ξ²

Related to Power: 1 - Ξ²

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for visualizations
sns.set_style("whitegrid")
np.random.seed(42)

class HypothesisTestingFramework:
    """
    Your complete guide to hypothesis testing!
    Master the art of statistical inference πŸ”¬
    """
    
    def __init__(self):
        self.alpha = 0.05  # Standard significance level
        
    def understand_p_values(self):
        """
        Demystifying p-values: What they really mean
        """
        print("🎯 UNDERSTANDING P-VALUES")
        print("=" * 60)
        
        print("πŸ“Š What is a p-value?")
        print("The probability of observing data at least as extreme as what we got,")
        print("assuming the null hypothesis is TRUE.")
        print("\n⚠️ Common Misconceptions:")
        print("❌ P-value is NOT the probability that Hβ‚€ is true")
        print("❌ P-value is NOT the probability of making an error")
        print("❌ 1-p is NOT the probability that H₁ is true")
        
        print("\nπŸ“ P-value Interpretation Scale:")
        print("-" * 40)
        
        p_values = [0.001, 0.01, 0.05, 0.10, 0.50]
        interpretations = [
            "Very strong evidence against Hβ‚€",
            "Strong evidence against Hβ‚€",
            "Moderate evidence against Hβ‚€",
            "Weak evidence against Hβ‚€",
            "No evidence against Hβ‚€"
        ]
        
        for p, interp in zip(p_values, interpretations):
            symbol = "⭐⭐⭐" if p <= 0.001 else "⭐⭐" if p <= 0.01 else "⭐" if p <= 0.05 else "β—‹"
            print(f"p = {p:.3f}: {interp} {symbol}")
        
        # Simulation example
        print("\nπŸ”¬ Simulation: Coin Fairness Test")
        print("-" * 40)
        
        # Test if a coin is fair (Hβ‚€: p = 0.5)
        n_flips = 100
        observed_heads = 60  # We observed 60 heads
        
        # Exact test using binomial
        p_value = stats.binom_test(observed_heads, n_flips, p=0.5, alternative='two-sided')
        
        print(f"Flips: {n_flips}, Heads: {observed_heads}")
        print(f"Hβ‚€: Coin is fair (p = 0.5)")
        print(f"H₁: Coin is not fair (p β‰  0.5)")
        print(f"P-value: {p_value:.4f}")
        
        if p_value < self.alpha:
            print(f"βœ… Reject Hβ‚€: Evidence suggests coin is unfair")
        else:
            print(f"❌ Fail to reject Hβ‚€: Insufficient evidence of unfairness")
    
    def one_sample_t_test(self):
        """
        One-sample t-test: Testing a population mean
        """
        print("\nπŸ“ˆ ONE-SAMPLE T-TEST")
        print("=" * 60)
        
        # Scenario: Coffee shop claims average wait time is 3 minutes
        claimed_mean = 3.0
        
        # Sample data (actual wait times in minutes)
        wait_times = np.array([2.5, 3.2, 4.1, 2.8, 3.5, 3.9, 2.7, 3.3, 4.2, 3.6,
                               2.9, 3.4, 3.8, 2.6, 3.7, 4.0, 3.1, 2.8, 3.5, 3.2])
        
        print("β˜• Coffee Shop Wait Time Analysis")
        print(f"Claimed average: {claimed_mean} minutes")
        print(f"Sample size: {len(wait_times)}")
        print(f"Sample mean: {wait_times.mean():.2f} minutes")
        print(f"Sample std: {wait_times.std(ddof=1):.2f} minutes")
        
        # Perform t-test
        t_stat, p_value = stats.ttest_1samp(wait_times, claimed_mean)
        
        print(f"\nTest Results:")
        print(f"t-statistic: {t_stat:.4f}")
        print(f"p-value: {p_value:.4f}")
        
        # Calculate confidence interval
        confidence = 0.95
        ci = stats.t.interval(confidence, len(wait_times)-1, 
                             loc=wait_times.mean(), 
                             scale=wait_times.std(ddof=1)/np.sqrt(len(wait_times)))
        
        print(f"\n{confidence*100:.0f}% Confidence Interval: [{ci[0]:.2f}, {ci[1]:.2f}]")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print(f"βœ… Reject Hβ‚€: Wait time is significantly different from {claimed_mean} min")
            print(f"   The coffee shop's claim appears to be false!")
        else:
            print(f"❌ Fail to reject Hβ‚€: No significant difference from {claimed_mean} min")
            print(f"   The coffee shop's claim is supported by the data.")
        
        # Effect size (Cohen's d)
        cohens_d = (wait_times.mean() - claimed_mean) / wait_times.std(ddof=1)
        print(f"\nπŸ“Š Effect Size (Cohen's d): {cohens_d:.3f}")
        if abs(cohens_d) < 0.2:
            print("   Negligible effect")
        elif abs(cohens_d) < 0.5:
            print("   Small effect")
        elif abs(cohens_d) < 0.8:
            print("   Medium effect")
        else:
            print("   Large effect")
    
    def two_sample_t_test(self):
        """
        Two-sample t-test: Comparing two groups
        """
        print("\nπŸ“Š TWO-SAMPLE T-TEST")
        print("=" * 60)
        
        # Scenario: Comparing effectiveness of two teaching methods
        np.random.seed(42)
        
        # Test scores for two groups
        method_a = np.random.normal(75, 10, 30)  # Traditional method
        method_b = np.random.normal(80, 12, 35)  # New method
        
        print("πŸŽ“ Teaching Methods Comparison")
        print(f"Method A (Traditional): n={len(method_a)}, mean={method_a.mean():.1f}, std={method_a.std():.1f}")
        print(f"Method B (New):         n={len(method_b)}, mean={method_b.mean():.1f}, std={method_b.std():.1f}")
        
        # Test for equal variances (Levene's test)
        levene_stat, levene_p = stats.levene(method_a, method_b)
        print(f"\nπŸ“ Levene's Test for Equal Variances:")
        print(f"   p-value: {levene_p:.4f}")
        
        equal_var = levene_p > 0.05
        print(f"   Variances are {'equal' if equal_var else 'unequal'} (use equal_var={equal_var})")
        
        # Perform t-test
        t_stat, p_value = stats.ttest_ind(method_a, method_b, equal_var=equal_var)
        
        print(f"\nπŸ“Š Two-Sample T-Test Results:")
        print(f"   t-statistic: {t_stat:.4f}")
        print(f"   p-value: {p_value:.4f}")
        
        # Effect size (Cohen's d)
        pooled_std = np.sqrt(((len(method_a)-1)*method_a.var() + (len(method_b)-1)*method_b.var()) / 
                             (len(method_a) + len(method_b) - 2))
        cohens_d = (method_b.mean() - method_a.mean()) / pooled_std
        
        print(f"   Cohen's d: {cohens_d:.3f}")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print("βœ… Reject Hβ‚€: Significant difference between methods")
            if method_b.mean() > method_a.mean():
                print("   The new method appears to be better!")
            else:
                print("   The traditional method appears to be better!")
        else:
            print("❌ Fail to reject Hβ‚€: No significant difference between methods")
    
    def paired_t_test(self):
        """
        Paired t-test: Before and after comparisons
        """
        print("\nπŸ”„ PAIRED T-TEST")
        print("=" * 60)
        
        # Scenario: Weight loss program (before and after)
        np.random.seed(42)
        
        n_participants = 20
        before_weight = np.random.normal(180, 20, n_participants)
        # After weight (most lose weight, but with variation)
        weight_loss = np.random.normal(8, 5, n_participants)
        after_weight = before_weight - weight_loss
        
        print("πŸƒ Weight Loss Program Analysis")
        print(f"Participants: {n_participants}")
        print(f"Before: mean={before_weight.mean():.1f} lbs, std={before_weight.std():.1f}")
        print(f"After:  mean={after_weight.mean():.1f} lbs, std={after_weight.std():.1f}")
        print(f"Average loss: {(before_weight - after_weight).mean():.1f} lbs")
        
        # Perform paired t-test
        t_stat, p_value = stats.ttest_rel(before_weight, after_weight)
        
        print(f"\nπŸ“Š Paired T-Test Results:")
        print(f"   t-statistic: {t_stat:.4f}")
        print(f"   p-value: {p_value:.4f}")
        
        # Calculate confidence interval for the difference
        differences = before_weight - after_weight
        ci = stats.t.interval(0.95, len(differences)-1,
                             loc=differences.mean(),
                             scale=differences.std(ddof=1)/np.sqrt(len(differences)))
        
        print(f"\n95% CI for weight loss: [{ci[0]:.1f}, {ci[1]:.1f}] lbs")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print("βœ… Reject Hβ‚€: Significant weight loss detected")
            print(f"   The program is effective! Average loss: {differences.mean():.1f} lbs")
        else:
            print("❌ Fail to reject Hβ‚€: No significant weight loss")
    
    def chi_square_test(self):
        """
        Chi-square test: Testing categorical relationships
        """
        print("\n🎲 CHI-SQUARE TEST")
        print("=" * 60)
        
        # Scenario: Is customer satisfaction related to age group?
        # Creating contingency table
        data = pd.DataFrame({
            'Age_Group': ['18-30'] * 50 + ['31-50'] * 60 + ['51+'] * 40,
            'Satisfaction': np.random.choice(['Low', 'Medium', 'High'], 150, 
                                           p=[0.2, 0.5, 0.3])
        })
        
        # Add some relationship (older = more satisfied)
        data.loc[(data['Age_Group'] == '51+') & (data['Satisfaction'] == 'Low'), 'Satisfaction'] = 'High'
        
        # Create contingency table
        cont_table = pd.crosstab(data['Age_Group'], data['Satisfaction'])
        
        print("πŸ“Š Customer Satisfaction by Age Group")
        print(cont_table)
        
        # Perform chi-square test
        chi2, p_value, dof, expected = stats.chi2_contingency(cont_table)
        
        print(f"\nπŸ“ˆ Chi-Square Test Results:")
        print(f"   χ² statistic: {chi2:.4f}")
        print(f"   p-value: {p_value:.4f}")
        print(f"   Degrees of freedom: {dof}")
        
        # Calculate CramΓ©r's V (effect size)
        n = cont_table.sum().sum()
        min_dim = min(cont_table.shape[0] - 1, cont_table.shape[1] - 1)
        cramers_v = np.sqrt(chi2 / (n * min_dim))
        
        print(f"   CramΓ©r's V: {cramers_v:.3f}")
        
        # Decision
        print(f"\n🎯 Decision (α = {self.alpha}):")
        if p_value < self.alpha:
            print("βœ… Reject Hβ‚€: Satisfaction and age are related")
            print("   There's a significant association between age and satisfaction!")
        else:
            print("❌ Fail to reject Hβ‚€: No significant relationship")

# Run demonstrations
testing = HypothesisTestingFramework()
testing.understand_p_values()
testing.one_sample_t_test()
testing.two_sample_t_test()
testing.paired_t_test()
testing.chi_square_test()

The P-Value Scale: Making Decisions πŸ“Š

p = 0.02
p < 0.01
Very Strong
p < 0.05
Strong
p < 0.10
Moderate
p β‰₯ 0.10
Weak/None

Choosing the Right Test 🎯

Test Type Use Case Data Type Assumptions Example
One-Sample t-test Compare sample mean to known value Continuous Normal distribution Is average height 170cm?
Two-Sample t-test Compare means of two groups Continuous Normal, equal variances Drug A vs Drug B
Paired t-test Before/after comparisons Continuous Normal differences Weight before/after diet
Chi-Square Test independence Categorical Expected freq β‰₯ 5 Gender vs Product preference
ANOVA Compare 3+ groups Continuous Normal, equal variances Compare 3 teaching methods
Mann-Whitney U Non-parametric alternative Ordinal/Continuous No normality required Compare ratings
import numpy as np
from scipy import stats
import pandas as pd

class TestSelection:
    """
    Smart test selection based on your data
    """
    
    def select_test(self, data_type, n_groups, paired=False, normal=True):
        """
        Decision tree for test selection
        """
        if data_type == 'categorical':
            if n_groups == 2:
                return "Chi-square test or Fisher's exact test"
            else:
                return "Chi-square test"
        
        elif data_type == 'continuous':
            if n_groups == 1:
                if normal:
                    return "One-sample t-test"
                else:
                    return "Wilcoxon signed-rank test"
            
            elif n_groups == 2:
                if paired:
                    if normal:
                        return "Paired t-test"
                    else:
                        return "Wilcoxon signed-rank test"
                else:
                    if normal:
                        return "Two-sample t-test"
                    else:
                        return "Mann-Whitney U test"
            
            else:  # n_groups > 2
                if normal:
                    return "One-way ANOVA"
                else:
                    return "Kruskal-Wallis test"
        
        return "Consult a statistician!"
    
    def normality_tests(self, data):
        """
        Test if data follows normal distribution
        """
        print("πŸ” NORMALITY TESTS")
        print("=" * 40)
        
        # Shapiro-Wilk test (best for small samples)
        if len(data) <= 5000:
            stat, p_value = stats.shapiro(data)
            print(f"Shapiro-Wilk Test:")
            print(f"  Statistic: {stat:.4f}")
            print(f"  P-value: {p_value:.4f}")
            
            if p_value > 0.05:
                print("  βœ… Data appears to be normal")
            else:
                print("  ❌ Data is not normally distributed")
        
        # Kolmogorov-Smirnov test
        stat, p_value = stats.kstest(data, 'norm', 
                                     args=(data.mean(), data.std()))
        print(f"\nKolmogorov-Smirnov Test:")
        print(f"  Statistic: {stat:.4f}")
        print(f"  P-value: {p_value:.4f}")
        
        # Anderson-Darling test
        result = stats.anderson(data)
        print(f"\nAnderson-Darling Test:")
        print(f"  Statistic: {result.statistic:.4f}")
        print(f"  Critical values: {result.critical_values}")
        
        return p_value > 0.05

# Example usage
selector = TestSelection()

# Test selection examples
print("πŸ“‹ TEST SELECTION GUIDE")
print("=" * 40)

scenarios = [
    ("Continuous", 1, False, True),
    ("Continuous", 2, False, True),
    ("Continuous", 2, True, True),
    ("Continuous", 3, False, True),
    ("Continuous", 2, False, False),
    ("Categorical", 2, False, None)
]

for data_type, n_groups, paired, normal in scenarios:
    test = selector.select_test(data_type, n_groups, paired, normal)
    print(f"\nData: {data_type}, Groups: {n_groups}, Paired: {paired}, Normal: {normal}")
    print(f"β†’ Use: {test}")

# Check normality of sample data
print("\n" + "=" * 40)
sample_data = np.random.normal(100, 15, 100)
selector.normality_tests(sample_data)

Power Analysis: Planning Your Study πŸ’ͺ

Statistical Power = 1 - Ξ²

The probability of correctly rejecting a false null hypothesis.

graph LR A[Power Analysis] --> B[Sample Size] A --> C[Effect Size] A --> D[Significance Level Ξ±] A --> E[Power 1-Ξ²] B --> F[How many subjects?] C --> G[How big is the difference?] D --> H[Type I error rate] E --> I[Type II error rate] style A fill:#0284c7 style E fill:#10b981
import numpy as np
from statsmodels.stats.power import TTestPower
import matplotlib.pyplot as plt

class PowerAnalysis:
    """
    Plan your study with proper power analysis
    """
    
    def sample_size_calculation(self):
        """
        Calculate required sample size
        """
        print("πŸ“Š SAMPLE SIZE CALCULATION")
        print("=" * 40)
        
        # Parameters
        effect_size = 0.5  # Medium effect
        alpha = 0.05
        power = 0.8
        
        # Calculate sample size
        analysis = TTestPower()
        
        # For two-sample t-test
        n = analysis.solve_power(effect_size=effect_size,
                               alpha=alpha,
                               power=power,
                               ratio=1,  # Equal group sizes
                               alternative='two-sided')
        
        print(f"Parameters:")
        print(f"  Effect size (Cohen's d): {effect_size}")
        print(f"  Significance level (Ξ±): {alpha}")
        print(f"  Desired power (1-Ξ²): {power}")
        
        print(f"\nβœ… Required sample size per group: {np.ceil(n):.0f}")
        print(f"   Total sample size: {np.ceil(n*2):.0f}")
        
        # Show how power changes with sample size
        print("\nπŸ“ˆ Power vs Sample Size:")
        sample_sizes = [10, 20, 30, 50, 100, 200]
        
        for n in sample_sizes:
            power = analysis.solve_power(effect_size=effect_size,
                                       alpha=alpha,
                                       nobs1=n,
                                       ratio=1,
                                       alternative='two-sided')
            print(f"  n={n:3d}: Power = {power:.3f}")
        
        # Effect size guidelines
        print("\nπŸ“ Cohen's d Effect Size Guidelines:")
        print("  Small:  d = 0.2")
        print("  Medium: d = 0.5")
        print("  Large:  d = 0.8")
    
    def minimum_detectable_effect(self):
        """
        What effect size can we detect with our sample?
        """
        print("\nπŸ” MINIMUM DETECTABLE EFFECT")
        print("=" * 40)
        
        # Given constraints
        n_available = 50
        alpha = 0.05
        power = 0.8
        
        analysis = TTestPower()
        effect_size = analysis.solve_power(nobs1=n_available,
                                          alpha=alpha,
                                          power=power,
                                          ratio=1,
                                          alternative='two-sided')
        
        print(f"Given:")
        print(f"  Available sample size: {n_available} per group")
        print(f"  Significance level: {alpha}")
        print(f"  Desired power: {power}")
        
        print(f"\nβœ… Minimum detectable effect size: {effect_size:.3f}")
        
        if effect_size < 0.2:
            print("   Can detect even small effects!")
        elif effect_size < 0.5:
            print("   Can detect medium and large effects")
        elif effect_size < 0.8:
            print("   Can only detect large effects")
        else:
            print("   ⚠️ Sample too small for reasonable power")

# Run power analysis
power_analysis = PowerAnalysis()
power_analysis.sample_size_calculation()
power_analysis.minimum_detectable_effect()

Multiple Testing Correction πŸ”„

The Multiple Comparisons Problem

If you run 20 tests at Ξ± = 0.05, you expect 1 false positive by chance!

Solutions: Bonferroni, Holm, FDR corrections

import numpy as np
from statsmodels.stats.multitest import multipletests

class MultipleTestingCorrection:
    """
    Handle multiple comparisons properly
    """
    
    def demonstrate_problem(self):
        """
        Show why we need correction
        """
        print("πŸ”„ MULTIPLE TESTING PROBLEM")
        print("=" * 40)
        
        # Simulate 20 tests with no real effect (all null true)
        np.random.seed(42)
        n_tests = 20
        p_values = np.random.uniform(0, 1, n_tests)
        
        # Sort for display
        p_values = np.sort(p_values)
        
        print(f"Running {n_tests} tests at Ξ± = 0.05")
        print("\nUncorrected results:")
        
        significant = p_values < 0.05
        print(f"  Significant tests: {significant.sum()}")
        print(f"  False positive rate: {significant.sum()/n_tests*100:.1f}%")
        
        if significant.any():
            print(f"  Smallest p-value: {p_values[0]:.4f}")
        
        # Apply corrections
        print("\nπŸ“Š With Multiple Testing Corrections:")
        
        methods = ['bonferroni', 'holm', 'fdr_bh']
        method_names = ['Bonferroni', 'Holm-Bonferroni', 'Benjamini-Hochberg FDR']
        
        for method, name in zip(methods, method_names):
            reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method=method)
            
            print(f"\n{name}:")
            print(f"  Significant after correction: {reject.sum()}")
            print(f"  Adjusted Ξ±: {0.05/n_tests if method=='bonferroni' else 'varies'}")
        
        print("\nπŸ’‘ Insights:")
        print("  β€’ Bonferroni: Most conservative (Ξ±/n)")
        print("  β€’ Holm: Slightly less conservative")
        print("  β€’ FDR: Controls false discovery rate")
        print("  β€’ Use FDR for exploratory analyses")

# Demonstrate
mtc = MultipleTestingCorrection()
mtc.demonstrate_problem()

Practical Decision Framework 🎯

Your Hypothesis Testing Checklist

  1. βœ… Define Hβ‚€ and H₁ clearly
  2. βœ… Choose significance level Ξ± (usually 0.05)
  3. βœ… Check test assumptions (normality, independence)
  4. βœ… Select appropriate test
  5. βœ… Calculate test statistic and p-value
  6. βœ… Make decision (reject or fail to reject)
  7. βœ… Report effect size and confidence interval
  8. βœ… Consider practical significance

Common Pitfalls and Best Practices 🚧

print("⚠️ COMMON PITFALLS TO AVOID")
print("=" * 40)

# Pitfall 1: P-hacking
print("\n❌ P-hacking (Data Dredging)")
print("Testing multiple hypotheses until you find p < 0.05")
print("βœ… Solution: Pre-register hypotheses, use corrections")

# Pitfall 2: Ignoring effect size
print("\n❌ Focusing only on p-values")
print("Statistical significance β‰  Practical significance")
print("βœ… Solution: Always report effect sizes and CIs")

# Pitfall 3: Violating assumptions
print("\n❌ Using parametric tests on non-normal data")
print("βœ… Solution: Check assumptions, use non-parametric alternatives")

# Pitfall 4: Misinterpreting p-values
print("\n❌ 'p = 0.04 means 4% chance Hβ‚€ is true'")
print("βœ… Correct: 'If Hβ‚€ is true, 4% chance of this extreme data'")

# Pitfall 5: Publication bias
print("\n❌ Only publishing significant results")
print("βœ… Solution: Report all results, including non-significant")

# Best Practices
print("\n✨ BEST PRACTICES")
print("=" * 40)
print("1. Plan sample size with power analysis")
print("2. State hypotheses before collecting data")
print("3. Report exact p-values (not just p < 0.05)")
print("4. Include confidence intervals")
print("5. Consider multiple testing corrections")
print("6. Replicate important findings")
print("7. Share data and code for transparency")

Summary: Your Hypothesis Testing Toolkit βœ…

🎯 Key Takeaways:

# Quick Reference Card - Hypothesis Testing
from scipy import stats
import numpy as np

# ONE-SAMPLE T-TEST
# Test if population mean equals a value
data = np.array([...])
null_mean = 100
t_stat, p_value = stats.ttest_1samp(data, null_mean)

# TWO-SAMPLE T-TEST
# Compare means of two independent groups
group1 = np.array([...])
group2 = np.array([...])
t_stat, p_value = stats.ttest_ind(group1, group2)

# PAIRED T-TEST
# Compare paired observations (before/after)
before = np.array([...])
after = np.array([...])
t_stat, p_value = stats.ttest_rel(before, after)

# CHI-SQUARE TEST
# Test independence of categorical variables
contingency_table = [[10, 20], [30, 40]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)

# ANOVA
# Compare means of 3+ groups
group1, group2, group3 = [...], [...], [...]
f_stat, p_value = stats.f_oneway(group1, group2, group3)

# NON-PARAMETRIC ALTERNATIVES
# Mann-Whitney U (alternative to two-sample t)
u_stat, p_value = stats.mannwhitneyu(group1, group2)

# Wilcoxon signed-rank (alternative to paired t)
w_stat, p_value = stats.wilcoxon(before, after)

# EFFECT SIZE CALCULATIONS
# Cohen's d
cohens_d = (mean1 - mean2) / pooled_std

# CONFIDENCE INTERVALS
# 95% CI for mean
from scipy import stats
ci = stats.t.interval(0.95, len(data)-1, 
                      loc=np.mean(data), 
                      scale=stats.sem(data))

print("πŸ”¬ Master hypothesis testing for confident decisions!")