π¬ Hypothesis Testing: Making Data-Driven Decisions
Imagine you're a judge in a courtroom βοΈ. The defendant is "innocent until proven guilty" - that's your null hypothesis. You need strong evidence to convict (reject the null). This is hypothesis testing! It's the scientific method for data, helping us make decisions when we can't be 100% certain. From medicine to marketing, hypothesis testing separates gut feelings from statistical facts.
The Foundation: Understanding Hypotheses π―
A cookie factory claims their cookies weigh 50g on average. You're the quality inspector. Your null hypothesis (Hβ) is "the cookies DO weigh 50g." Your alternative hypothesis (Hβ) is "the cookies DON'T weigh 50g." You take a sample of cookies and weigh them. If they're way off 50g, you reject the factory's claim. But how far is "way off"? That's what hypothesis testing tells us!
Hβ: Null Hypothesis
The "status quo" or "no effect" assumption. Examples:
- The new drug has no effect
- There is no difference between groups
- The correlation is zero
- The mean equals a specific value
Hβ: Alternative Hypothesis
What you're trying to prove. Can be:
- Two-tailed: ΞΌ β ΞΌβ (different from)
- Left-tailed: ΞΌ < ΞΌβ (less than)
- Right-tailed: ΞΌ > ΞΌβ (greater than)
Types of Errors: The Risk Trade-off β οΈ
In hypothesis testing, we can make two types of mistakes. Think of it like a smoke detector: it can fail by alarming when there's no fire (Type I) or by staying silent when there IS a fire (Type II).
Type I Error (Ξ±)
False Positive
π¨ Rejecting Hβ when it's actually TRUE
Example: Convicting an innocent person
Probability: Ξ± (significance level)
Typically set at 0.05 (5%)
Type II Error (Ξ²)
False Negative
π΄ Failing to reject Hβ when it's actually FALSE
Example: Acquitting a guilty person
Probability: Ξ²
Related to Power: 1 - Ξ²
import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for visualizations
sns.set_style("whitegrid")
np.random.seed(42)
class HypothesisTestingFramework:
"""
Your complete guide to hypothesis testing!
Master the art of statistical inference π¬
"""
def __init__(self):
self.alpha = 0.05 # Standard significance level
def understand_p_values(self):
"""
Demystifying p-values: What they really mean
"""
print("π― UNDERSTANDING P-VALUES")
print("=" * 60)
print("π What is a p-value?")
print("The probability of observing data at least as extreme as what we got,")
print("assuming the null hypothesis is TRUE.")
print("\nβ οΈ Common Misconceptions:")
print("β P-value is NOT the probability that Hβ is true")
print("β P-value is NOT the probability of making an error")
print("β 1-p is NOT the probability that Hβ is true")
print("\nπ P-value Interpretation Scale:")
print("-" * 40)
p_values = [0.001, 0.01, 0.05, 0.10, 0.50]
interpretations = [
"Very strong evidence against Hβ",
"Strong evidence against Hβ",
"Moderate evidence against Hβ",
"Weak evidence against Hβ",
"No evidence against Hβ"
]
for p, interp in zip(p_values, interpretations):
symbol = "βββ" if p <= 0.001 else "ββ" if p <= 0.01 else "β" if p <= 0.05 else "β"
print(f"p = {p:.3f}: {interp} {symbol}")
# Simulation example
print("\n㪠Simulation: Coin Fairness Test")
print("-" * 40)
# Test if a coin is fair (Hβ: p = 0.5)
n_flips = 100
observed_heads = 60 # We observed 60 heads
# Exact test using binomial
p_value = stats.binom_test(observed_heads, n_flips, p=0.5, alternative='two-sided')
print(f"Flips: {n_flips}, Heads: {observed_heads}")
print(f"Hβ: Coin is fair (p = 0.5)")
print(f"Hβ: Coin is not fair (p β 0.5)")
print(f"P-value: {p_value:.4f}")
if p_value < self.alpha:
print(f"β
Reject Hβ: Evidence suggests coin is unfair")
else:
print(f"β Fail to reject Hβ: Insufficient evidence of unfairness")
def one_sample_t_test(self):
"""
One-sample t-test: Testing a population mean
"""
print("\nπ ONE-SAMPLE T-TEST")
print("=" * 60)
# Scenario: Coffee shop claims average wait time is 3 minutes
claimed_mean = 3.0
# Sample data (actual wait times in minutes)
wait_times = np.array([2.5, 3.2, 4.1, 2.8, 3.5, 3.9, 2.7, 3.3, 4.2, 3.6,
2.9, 3.4, 3.8, 2.6, 3.7, 4.0, 3.1, 2.8, 3.5, 3.2])
print("β Coffee Shop Wait Time Analysis")
print(f"Claimed average: {claimed_mean} minutes")
print(f"Sample size: {len(wait_times)}")
print(f"Sample mean: {wait_times.mean():.2f} minutes")
print(f"Sample std: {wait_times.std(ddof=1):.2f} minutes")
# Perform t-test
t_stat, p_value = stats.ttest_1samp(wait_times, claimed_mean)
print(f"\nTest Results:")
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
# Calculate confidence interval
confidence = 0.95
ci = stats.t.interval(confidence, len(wait_times)-1,
loc=wait_times.mean(),
scale=wait_times.std(ddof=1)/np.sqrt(len(wait_times)))
print(f"\n{confidence*100:.0f}% Confidence Interval: [{ci[0]:.2f}, {ci[1]:.2f}]")
# Decision
print(f"\nπ― Decision (Ξ± = {self.alpha}):")
if p_value < self.alpha:
print(f"β
Reject Hβ: Wait time is significantly different from {claimed_mean} min")
print(f" The coffee shop's claim appears to be false!")
else:
print(f"β Fail to reject Hβ: No significant difference from {claimed_mean} min")
print(f" The coffee shop's claim is supported by the data.")
# Effect size (Cohen's d)
cohens_d = (wait_times.mean() - claimed_mean) / wait_times.std(ddof=1)
print(f"\nπ Effect Size (Cohen's d): {cohens_d:.3f}")
if abs(cohens_d) < 0.2:
print(" Negligible effect")
elif abs(cohens_d) < 0.5:
print(" Small effect")
elif abs(cohens_d) < 0.8:
print(" Medium effect")
else:
print(" Large effect")
def two_sample_t_test(self):
"""
Two-sample t-test: Comparing two groups
"""
print("\nπ TWO-SAMPLE T-TEST")
print("=" * 60)
# Scenario: Comparing effectiveness of two teaching methods
np.random.seed(42)
# Test scores for two groups
method_a = np.random.normal(75, 10, 30) # Traditional method
method_b = np.random.normal(80, 12, 35) # New method
print("π Teaching Methods Comparison")
print(f"Method A (Traditional): n={len(method_a)}, mean={method_a.mean():.1f}, std={method_a.std():.1f}")
print(f"Method B (New): n={len(method_b)}, mean={method_b.mean():.1f}, std={method_b.std():.1f}")
# Test for equal variances (Levene's test)
levene_stat, levene_p = stats.levene(method_a, method_b)
print(f"\nπ Levene's Test for Equal Variances:")
print(f" p-value: {levene_p:.4f}")
equal_var = levene_p > 0.05
print(f" Variances are {'equal' if equal_var else 'unequal'} (use equal_var={equal_var})")
# Perform t-test
t_stat, p_value = stats.ttest_ind(method_a, method_b, equal_var=equal_var)
print(f"\nπ Two-Sample T-Test Results:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value:.4f}")
# Effect size (Cohen's d)
pooled_std = np.sqrt(((len(method_a)-1)*method_a.var() + (len(method_b)-1)*method_b.var()) /
(len(method_a) + len(method_b) - 2))
cohens_d = (method_b.mean() - method_a.mean()) / pooled_std
print(f" Cohen's d: {cohens_d:.3f}")
# Decision
print(f"\nπ― Decision (Ξ± = {self.alpha}):")
if p_value < self.alpha:
print("β
Reject Hβ: Significant difference between methods")
if method_b.mean() > method_a.mean():
print(" The new method appears to be better!")
else:
print(" The traditional method appears to be better!")
else:
print("β Fail to reject Hβ: No significant difference between methods")
def paired_t_test(self):
"""
Paired t-test: Before and after comparisons
"""
print("\nπ PAIRED T-TEST")
print("=" * 60)
# Scenario: Weight loss program (before and after)
np.random.seed(42)
n_participants = 20
before_weight = np.random.normal(180, 20, n_participants)
# After weight (most lose weight, but with variation)
weight_loss = np.random.normal(8, 5, n_participants)
after_weight = before_weight - weight_loss
print("π Weight Loss Program Analysis")
print(f"Participants: {n_participants}")
print(f"Before: mean={before_weight.mean():.1f} lbs, std={before_weight.std():.1f}")
print(f"After: mean={after_weight.mean():.1f} lbs, std={after_weight.std():.1f}")
print(f"Average loss: {(before_weight - after_weight).mean():.1f} lbs")
# Perform paired t-test
t_stat, p_value = stats.ttest_rel(before_weight, after_weight)
print(f"\nπ Paired T-Test Results:")
print(f" t-statistic: {t_stat:.4f}")
print(f" p-value: {p_value:.4f}")
# Calculate confidence interval for the difference
differences = before_weight - after_weight
ci = stats.t.interval(0.95, len(differences)-1,
loc=differences.mean(),
scale=differences.std(ddof=1)/np.sqrt(len(differences)))
print(f"\n95% CI for weight loss: [{ci[0]:.1f}, {ci[1]:.1f}] lbs")
# Decision
print(f"\nπ― Decision (Ξ± = {self.alpha}):")
if p_value < self.alpha:
print("β
Reject Hβ: Significant weight loss detected")
print(f" The program is effective! Average loss: {differences.mean():.1f} lbs")
else:
print("β Fail to reject Hβ: No significant weight loss")
def chi_square_test(self):
"""
Chi-square test: Testing categorical relationships
"""
print("\nπ² CHI-SQUARE TEST")
print("=" * 60)
# Scenario: Is customer satisfaction related to age group?
# Creating contingency table
data = pd.DataFrame({
'Age_Group': ['18-30'] * 50 + ['31-50'] * 60 + ['51+'] * 40,
'Satisfaction': np.random.choice(['Low', 'Medium', 'High'], 150,
p=[0.2, 0.5, 0.3])
})
# Add some relationship (older = more satisfied)
data.loc[(data['Age_Group'] == '51+') & (data['Satisfaction'] == 'Low'), 'Satisfaction'] = 'High'
# Create contingency table
cont_table = pd.crosstab(data['Age_Group'], data['Satisfaction'])
print("π Customer Satisfaction by Age Group")
print(cont_table)
# Perform chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(cont_table)
print(f"\nπ Chi-Square Test Results:")
print(f" ΟΒ² statistic: {chi2:.4f}")
print(f" p-value: {p_value:.4f}")
print(f" Degrees of freedom: {dof}")
# Calculate CramΓ©r's V (effect size)
n = cont_table.sum().sum()
min_dim = min(cont_table.shape[0] - 1, cont_table.shape[1] - 1)
cramers_v = np.sqrt(chi2 / (n * min_dim))
print(f" CramΓ©r's V: {cramers_v:.3f}")
# Decision
print(f"\nπ― Decision (Ξ± = {self.alpha}):")
if p_value < self.alpha:
print("β
Reject Hβ: Satisfaction and age are related")
print(" There's a significant association between age and satisfaction!")
else:
print("β Fail to reject Hβ: No significant relationship")
# Run demonstrations
testing = HypothesisTestingFramework()
testing.understand_p_values()
testing.one_sample_t_test()
testing.two_sample_t_test()
testing.paired_t_test()
testing.chi_square_test()
The P-Value Scale: Making Decisions π
Very Strong p < 0.05
Strong p < 0.10
Moderate p β₯ 0.10
Weak/None
Choosing the Right Test π―
| Test Type | Use Case | Data Type | Assumptions | Example |
|---|---|---|---|---|
| One-Sample t-test | Compare sample mean to known value | Continuous | Normal distribution | Is average height 170cm? |
| Two-Sample t-test | Compare means of two groups | Continuous | Normal, equal variances | Drug A vs Drug B |
| Paired t-test | Before/after comparisons | Continuous | Normal differences | Weight before/after diet |
| Chi-Square | Test independence | Categorical | Expected freq β₯ 5 | Gender vs Product preference |
| ANOVA | Compare 3+ groups | Continuous | Normal, equal variances | Compare 3 teaching methods |
| Mann-Whitney U | Non-parametric alternative | Ordinal/Continuous | No normality required | Compare ratings |
import numpy as np
from scipy import stats
import pandas as pd
class TestSelection:
"""
Smart test selection based on your data
"""
def select_test(self, data_type, n_groups, paired=False, normal=True):
"""
Decision tree for test selection
"""
if data_type == 'categorical':
if n_groups == 2:
return "Chi-square test or Fisher's exact test"
else:
return "Chi-square test"
elif data_type == 'continuous':
if n_groups == 1:
if normal:
return "One-sample t-test"
else:
return "Wilcoxon signed-rank test"
elif n_groups == 2:
if paired:
if normal:
return "Paired t-test"
else:
return "Wilcoxon signed-rank test"
else:
if normal:
return "Two-sample t-test"
else:
return "Mann-Whitney U test"
else: # n_groups > 2
if normal:
return "One-way ANOVA"
else:
return "Kruskal-Wallis test"
return "Consult a statistician!"
def normality_tests(self, data):
"""
Test if data follows normal distribution
"""
print("π NORMALITY TESTS")
print("=" * 40)
# Shapiro-Wilk test (best for small samples)
if len(data) <= 5000:
stat, p_value = stats.shapiro(data)
print(f"Shapiro-Wilk Test:")
print(f" Statistic: {stat:.4f}")
print(f" P-value: {p_value:.4f}")
if p_value > 0.05:
print(" β
Data appears to be normal")
else:
print(" β Data is not normally distributed")
# Kolmogorov-Smirnov test
stat, p_value = stats.kstest(data, 'norm',
args=(data.mean(), data.std()))
print(f"\nKolmogorov-Smirnov Test:")
print(f" Statistic: {stat:.4f}")
print(f" P-value: {p_value:.4f}")
# Anderson-Darling test
result = stats.anderson(data)
print(f"\nAnderson-Darling Test:")
print(f" Statistic: {result.statistic:.4f}")
print(f" Critical values: {result.critical_values}")
return p_value > 0.05
# Example usage
selector = TestSelection()
# Test selection examples
print("π TEST SELECTION GUIDE")
print("=" * 40)
scenarios = [
("Continuous", 1, False, True),
("Continuous", 2, False, True),
("Continuous", 2, True, True),
("Continuous", 3, False, True),
("Continuous", 2, False, False),
("Categorical", 2, False, None)
]
for data_type, n_groups, paired, normal in scenarios:
test = selector.select_test(data_type, n_groups, paired, normal)
print(f"\nData: {data_type}, Groups: {n_groups}, Paired: {paired}, Normal: {normal}")
print(f"β Use: {test}")
# Check normality of sample data
print("\n" + "=" * 40)
sample_data = np.random.normal(100, 15, 100)
selector.normality_tests(sample_data)
Power Analysis: Planning Your Study πͺ
Statistical Power = 1 - Ξ²
The probability of correctly rejecting a false null hypothesis.
import numpy as np
from statsmodels.stats.power import TTestPower
import matplotlib.pyplot as plt
class PowerAnalysis:
"""
Plan your study with proper power analysis
"""
def sample_size_calculation(self):
"""
Calculate required sample size
"""
print("π SAMPLE SIZE CALCULATION")
print("=" * 40)
# Parameters
effect_size = 0.5 # Medium effect
alpha = 0.05
power = 0.8
# Calculate sample size
analysis = TTestPower()
# For two-sample t-test
n = analysis.solve_power(effect_size=effect_size,
alpha=alpha,
power=power,
ratio=1, # Equal group sizes
alternative='two-sided')
print(f"Parameters:")
print(f" Effect size (Cohen's d): {effect_size}")
print(f" Significance level (Ξ±): {alpha}")
print(f" Desired power (1-Ξ²): {power}")
print(f"\nβ
Required sample size per group: {np.ceil(n):.0f}")
print(f" Total sample size: {np.ceil(n*2):.0f}")
# Show how power changes with sample size
print("\nπ Power vs Sample Size:")
sample_sizes = [10, 20, 30, 50, 100, 200]
for n in sample_sizes:
power = analysis.solve_power(effect_size=effect_size,
alpha=alpha,
nobs1=n,
ratio=1,
alternative='two-sided')
print(f" n={n:3d}: Power = {power:.3f}")
# Effect size guidelines
print("\nπ Cohen's d Effect Size Guidelines:")
print(" Small: d = 0.2")
print(" Medium: d = 0.5")
print(" Large: d = 0.8")
def minimum_detectable_effect(self):
"""
What effect size can we detect with our sample?
"""
print("\nπ MINIMUM DETECTABLE EFFECT")
print("=" * 40)
# Given constraints
n_available = 50
alpha = 0.05
power = 0.8
analysis = TTestPower()
effect_size = analysis.solve_power(nobs1=n_available,
alpha=alpha,
power=power,
ratio=1,
alternative='two-sided')
print(f"Given:")
print(f" Available sample size: {n_available} per group")
print(f" Significance level: {alpha}")
print(f" Desired power: {power}")
print(f"\nβ
Minimum detectable effect size: {effect_size:.3f}")
if effect_size < 0.2:
print(" Can detect even small effects!")
elif effect_size < 0.5:
print(" Can detect medium and large effects")
elif effect_size < 0.8:
print(" Can only detect large effects")
else:
print(" β οΈ Sample too small for reasonable power")
# Run power analysis
power_analysis = PowerAnalysis()
power_analysis.sample_size_calculation()
power_analysis.minimum_detectable_effect()
Multiple Testing Correction π
The Multiple Comparisons Problem
If you run 20 tests at Ξ± = 0.05, you expect 1 false positive by chance!
Solutions: Bonferroni, Holm, FDR corrections
import numpy as np
from statsmodels.stats.multitest import multipletests
class MultipleTestingCorrection:
"""
Handle multiple comparisons properly
"""
def demonstrate_problem(self):
"""
Show why we need correction
"""
print("π MULTIPLE TESTING PROBLEM")
print("=" * 40)
# Simulate 20 tests with no real effect (all null true)
np.random.seed(42)
n_tests = 20
p_values = np.random.uniform(0, 1, n_tests)
# Sort for display
p_values = np.sort(p_values)
print(f"Running {n_tests} tests at Ξ± = 0.05")
print("\nUncorrected results:")
significant = p_values < 0.05
print(f" Significant tests: {significant.sum()}")
print(f" False positive rate: {significant.sum()/n_tests*100:.1f}%")
if significant.any():
print(f" Smallest p-value: {p_values[0]:.4f}")
# Apply corrections
print("\nπ With Multiple Testing Corrections:")
methods = ['bonferroni', 'holm', 'fdr_bh']
method_names = ['Bonferroni', 'Holm-Bonferroni', 'Benjamini-Hochberg FDR']
for method, name in zip(methods, method_names):
reject, p_adj, _, _ = multipletests(p_values, alpha=0.05, method=method)
print(f"\n{name}:")
print(f" Significant after correction: {reject.sum()}")
print(f" Adjusted Ξ±: {0.05/n_tests if method=='bonferroni' else 'varies'}")
print("\nπ‘ Insights:")
print(" β’ Bonferroni: Most conservative (Ξ±/n)")
print(" β’ Holm: Slightly less conservative")
print(" β’ FDR: Controls false discovery rate")
print(" β’ Use FDR for exploratory analyses")
# Demonstrate
mtc = MultipleTestingCorrection()
mtc.demonstrate_problem()
Practical Decision Framework π―
Your Hypothesis Testing Checklist
- β Define Hβ and Hβ clearly
- β Choose significance level Ξ± (usually 0.05)
- β Check test assumptions (normality, independence)
- β Select appropriate test
- β Calculate test statistic and p-value
- β Make decision (reject or fail to reject)
- β Report effect size and confidence interval
- β Consider practical significance
Common Pitfalls and Best Practices π§
print("β οΈ COMMON PITFALLS TO AVOID")
print("=" * 40)
# Pitfall 1: P-hacking
print("\nβ P-hacking (Data Dredging)")
print("Testing multiple hypotheses until you find p < 0.05")
print("β
Solution: Pre-register hypotheses, use corrections")
# Pitfall 2: Ignoring effect size
print("\nβ Focusing only on p-values")
print("Statistical significance β Practical significance")
print("β
Solution: Always report effect sizes and CIs")
# Pitfall 3: Violating assumptions
print("\nβ Using parametric tests on non-normal data")
print("β
Solution: Check assumptions, use non-parametric alternatives")
# Pitfall 4: Misinterpreting p-values
print("\nβ 'p = 0.04 means 4% chance Hβ is true'")
print("β
Correct: 'If Hβ is true, 4% chance of this extreme data'")
# Pitfall 5: Publication bias
print("\nβ Only publishing significant results")
print("β
Solution: Report all results, including non-significant")
# Best Practices
print("\n⨠BEST PRACTICES")
print("=" * 40)
print("1. Plan sample size with power analysis")
print("2. State hypotheses before collecting data")
print("3. Report exact p-values (not just p < 0.05)")
print("4. Include confidence intervals")
print("5. Consider multiple testing corrections")
print("6. Replicate important findings")
print("7. Share data and code for transparency")
Summary: Your Hypothesis Testing Toolkit β
π― Key Takeaways:
- Null Hypothesis (Hβ): The default assumption of no effect
- P-value: Probability of data given Hβ is true
- Type I Error: False positive (Ξ±)
- Type II Error: False negative (Ξ²)
- Power: Probability of detecting true effect (1-Ξ²)
- Effect Size: Magnitude of the difference
- Always report: Test used, p-value, effect size, CI
# Quick Reference Card - Hypothesis Testing
from scipy import stats
import numpy as np
# ONE-SAMPLE T-TEST
# Test if population mean equals a value
data = np.array([...])
null_mean = 100
t_stat, p_value = stats.ttest_1samp(data, null_mean)
# TWO-SAMPLE T-TEST
# Compare means of two independent groups
group1 = np.array([...])
group2 = np.array([...])
t_stat, p_value = stats.ttest_ind(group1, group2)
# PAIRED T-TEST
# Compare paired observations (before/after)
before = np.array([...])
after = np.array([...])
t_stat, p_value = stats.ttest_rel(before, after)
# CHI-SQUARE TEST
# Test independence of categorical variables
contingency_table = [[10, 20], [30, 40]]
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
# ANOVA
# Compare means of 3+ groups
group1, group2, group3 = [...], [...], [...]
f_stat, p_value = stats.f_oneway(group1, group2, group3)
# NON-PARAMETRIC ALTERNATIVES
# Mann-Whitney U (alternative to two-sample t)
u_stat, p_value = stats.mannwhitneyu(group1, group2)
# Wilcoxon signed-rank (alternative to paired t)
w_stat, p_value = stats.wilcoxon(before, after)
# EFFECT SIZE CALCULATIONS
# Cohen's d
cohens_d = (mean1 - mean2) / pooled_std
# CONFIDENCE INTERVALS
# 95% CI for mean
from scipy import stats
ci = stats.t.interval(0.95, len(data)-1,
loc=np.mean(data),
scale=stats.sem(data))
print("π¬ Master hypothesis testing for confident decisions!")