๐ฒ Random Number Generation: The Art of Controlled Chaos
Imagine you're a casino owner who needs perfectly fair dice, but also needs to replay exact games for security reviews. That's random number generation in NumPy! ๐ฐ It gives you randomness when you need it, reproducibility when you require it, and distributions for every occasion - from simulating quantum physics to generating synthetic datasets for machine learning.
The Random Universe: More Than Just Dice Rolls ๐
Random numbers are the lifeblood of data science. They power Monte Carlo simulations, bootstrap statistics, neural network initialization, data augmentation, A/B testing, and so much more. But here's the secret: computer-generated random numbers aren't truly random - they're pseudorandom, following deterministic algorithms that appear random. This is actually a feature, not a bug!
The Slot Machine Analogy ๐ฐ
Think of NumPy's random number generator as a sophisticated slot machine. The seed is like setting the starting position of all the reels. Pull the lever (call a random function) and you get a result. Reset to the same seed, pull again, and you get the exact same sequence! This is crucial for debugging, reproducible research, and testing.
import numpy as np
import matplotlib.pyplot as plt
# Random Number Generation Fundamentals ๐ฒ
def explore_random_basics():
"""
Understanding the foundations of random number generation.
Every data scientist's essential toolkit!
"""
print("๐ฒ Random Number Generation Basics")
print("=" * 60)
# The Random Number Generator (RNG)
rng = np.random.RandomState(42) # Create a generator with seed
print("Creating Random Numbers:")
print("-" * 40)
# Basic random generation
random_float = rng.random() # Single random float [0, 1)
random_array = rng.random(5) # Array of random floats
random_matrix = rng.random((3, 3)) # Random matrix
print(f"Single random float: {random_float:.6f}")
print(f"Random array: {random_array}")
print(f"Random matrix:\n{random_matrix}")
# Different ranges
print("\n๐ Random Numbers in Different Ranges:")
# Random integers
random_ints = rng.randint(0, 10, size=10) # 10 random ints [0, 10)
print(f"Random integers (0-9): {random_ints}")
# Random floats in custom range
random_range = rng.uniform(-5, 5, size=5) # Uniform in [-5, 5]
print(f"Random floats (-5 to 5): {random_range}")
# Random choice from array
options = ['red', 'green', 'blue', 'yellow']
random_colors = rng.choice(options, size=10, replace=True)
print(f"Random choices: {random_colors}")
# Shuffling
deck = np.arange(52) # Card deck
rng.shuffle(deck) # In-place shuffle
print(f"Shuffled deck (first 10): {deck[:10]}")
explore_random_basics()
# The Importance of Seeds ๐ฑ
def demonstrate_seed_importance():
"""
Seeds: The key to reproducibility in data science!
"""
print("\n๐ฑ The Power of Seeds")
print("=" * 60)
# Without setting seed - different every time
print("Without seed (run multiple times - different results):")
for i in range(3):
print(f" Run {i+1}: {np.random.random(3).round(3)}")
# With seed - same every time
print("\nWith seed=42 (always same results):")
for i in range(3):
np.random.seed(42)
print(f" Run {i+1}: {np.random.random(3).round(3)}")
# Practical example: Train/test split
print("\n๐ Reproducible Train/Test Split:")
data = np.arange(100)
# First split
np.random.seed(123)
np.random.shuffle(data)
train1, test1 = data[:80], data[80:]
# Reset and split again
data = np.arange(100)
np.random.seed(123) # Same seed
np.random.shuffle(data)
train2, test2 = data[:80], data[80:]
print(f"First split - test set (first 5): {test1[:5]}")
print(f"Second split - test set (first 5): {test2[:5]}")
print(f"Identical splits: {np.array_equal(test1, test2)}")
demonstrate_seed_importance()
Probability Distributions: Your Random Toolkit ๐
Different problems need different types of randomness. NumPy provides a rich collection of probability distributions, each with unique properties and use cases. Think of them as different dice - some fair, some weighted, some with infinite sides!
๐ฏ Uniform Distribution
Use Case: Equal probability for all outcomes
Real World: Dice rolls, random sampling, Monte Carlo
np.random.uniform(low=0, high=1, size=1000)
๐ Normal (Gaussian) Distribution
Use Case: Natural phenomena, measurement errors
Real World: Heights, test scores, stock returns
np.random.normal(loc=0, scale=1, size=1000)
โฑ๏ธ Exponential Distribution
Use Case: Time between events
Real World: Customer arrivals, component failures
np.random.exponential(scale=1.0, size=1000)
๐ฏ Binomial Distribution
Use Case: Number of successes in n trials
Real World: Coin flips, A/B testing, quality control
np.random.binomial(n=10, p=0.5, size=1000)
import numpy as np
import matplotlib.pyplot as plt
class DistributionExplorer:
"""
Master class for exploring probability distributions.
Your guide to the random universe! ๐
"""
def __init__(self, seed=42):
self.rng = np.random.RandomState(seed)
def uniform_distribution(self):
"""Uniform: Every outcome equally likely"""
print("๐ฒ Uniform Distribution")
print("=" * 60)
# Generate uniform random numbers
uniform_data = self.rng.uniform(0, 10, size=10000)
print("Properties:")
print(f" Range: [0, 10)")
print(f" Mean (theoretical): {5.0}")
print(f" Mean (actual): {uniform_data.mean():.3f}")
print(f" Std (theoretical): {10/np.sqrt(12):.3f}")
print(f" Std (actual): {uniform_data.std():.3f}")
# Real-world example: Random sampling
print("\n๐ Example: Random Sampling")
population = np.arange(1, 101) # Population of 100 items
sample = self.rng.choice(population, size=10, replace=False)
print(f"Random sample of 10: {sorted(sample)}")
# Dice simulation
print("\n๐ฒ Example: Dice Rolling")
dice_rolls = self.rng.randint(1, 7, size=1000)
for face in range(1, 7):
prob = (dice_rolls == face).mean()
print(f" Face {face}: {prob*100:.1f}% (expected: 16.7%)")
def normal_distribution(self):
"""Normal: The bell curve - nature's favorite"""
print("\n๐ Normal (Gaussian) Distribution")
print("=" * 60)
# Generate normal random numbers
mean, std = 100, 15 # IQ scores example
normal_data = self.rng.normal(mean, std, size=10000)
print("Properties:")
print(f" Mean (ฮผ): {mean}")
print(f" Std Dev (ฯ): {std}")
print(f" Actual mean: {normal_data.mean():.3f}")
print(f" Actual std: {normal_data.std():.3f}")
# 68-95-99.7 rule
within_1_std = np.sum(np.abs(normal_data - mean) <= std) / len(normal_data)
within_2_std = np.sum(np.abs(normal_data - mean) <= 2*std) / len(normal_data)
within_3_std = np.sum(np.abs(normal_data - mean) <= 3*std) / len(normal_data)
print("\n๐ 68-95-99.7 Rule:")
print(f" Within 1ฯ: {within_1_std*100:.1f}% (expected: 68.3%)")
print(f" Within 2ฯ: {within_2_std*100:.1f}% (expected: 95.4%)")
print(f" Within 3ฯ: {within_3_std*100:.1f}% (expected: 99.7%)")
# Real-world example: Height distribution
print("\n๐ฅ Example: Height Distribution")
male_heights = self.rng.normal(175, 7, size=1000) # cm
female_heights = self.rng.normal(162, 6, size=1000) # cm
print(f"Male heights: mean={male_heights.mean():.1f}cm, std={male_heights.std():.1f}cm")
print(f"Female heights: mean={female_heights.mean():.1f}cm, std={female_heights.std():.1f}cm")
print(f"Tallest 5% of males: >{np.percentile(male_heights, 95):.1f}cm")
# Central Limit Theorem demonstration
print("\n๐ฏ Central Limit Theorem Demo:")
sample_means = []
for _ in range(1000):
# Average of 30 uniform random numbers
sample = self.rng.uniform(0, 1, size=30)
sample_means.append(sample.mean())
sample_means = np.array(sample_means)
print(f"Uniform samples become normal when averaged!")
print(f"Mean of sample means: {sample_means.mean():.3f} (expected: 0.5)")
print(f"Std of sample means: {sample_means.std():.3f}")
def exponential_distribution(self):
"""Exponential: Time until next event"""
print("\nโฑ๏ธ Exponential Distribution")
print("=" * 60)
# Customer arrival times (average 1 every 5 minutes)
avg_time = 5.0 # minutes
arrival_intervals = self.rng.exponential(avg_time, size=1000)
print("Properties:")
print(f" Average interval (ฮป): {avg_time} minutes")
print(f" Actual mean: {arrival_intervals.mean():.3f} minutes")
print(f" Median: {np.median(arrival_intervals):.3f} minutes")
# Memoryless property
print("\n๐ Memoryless Property:")
print(" Probability of waiting > 5 min: "
f"{(arrival_intervals > 5).mean()*100:.1f}%")
print(" If already waited 5 min, prob of 5 more: "
f"{(arrival_intervals > 10).sum() / (arrival_intervals > 5).sum()*100:.1f}%")
# Real-world example: Server response times
print("\n๐ป Example: Server Response Times")
response_times = self.rng.exponential(0.1, size=10000) # 100ms average
print(f"Average response: {response_times.mean()*1000:.1f}ms")
print(f"95th percentile: {np.percentile(response_times, 95)*1000:.1f}ms")
print(f"99th percentile: {np.percentile(response_times, 99)*1000:.1f}ms")
def binomial_distribution(self):
"""Binomial: Counting successes"""
print("\n๐ฏ Binomial Distribution")
print("=" * 60)
# Coin flips
n_flips = 10
p_heads = 0.5
coin_results = self.rng.binomial(n_flips, p_heads, size=1000)
print("Coin Flip Experiment:")
print(f" Flips per trial: {n_flips}")
print(f" Probability of heads: {p_heads}")
print(f" Expected heads: {n_flips * p_heads}")
print(f" Actual mean heads: {coin_results.mean():.3f}")
# Distribution of results
for k in range(11):
prob = (coin_results == k).mean()
print(f" Exactly {k} heads: {prob*100:5.1f}%")
# Real-world example: A/B testing
print("\n๐งช Example: A/B Testing")
# Version A: 10% conversion rate
# Version B: 12% conversion rate
visitors_per_version = 1000
conversions_A = self.rng.binomial(1, 0.10, size=visitors_per_version)
conversions_B = self.rng.binomial(1, 0.12, size=visitors_per_version)
print(f"Version A: {conversions_A.sum()} conversions ({conversions_A.mean()*100:.1f}%)")
print(f"Version B: {conversions_B.sum()} conversions ({conversions_B.mean()*100:.1f}%)")
# Statistical significance (simplified)
difference = conversions_B.mean() - conversions_A.mean()
print(f"Observed lift: {difference*100:.1f}%")
def poisson_distribution(self):
"""Poisson: Counting events in fixed time/space"""
print("\n๐ Poisson Distribution")
print("=" * 60)
# Website visits per hour
avg_visits = 100
hourly_visits = self.rng.poisson(avg_visits, size=24) # 24 hours
print("Website Traffic Model:")
print(f" Average visits/hour: {avg_visits}")
print(f" Actual daily total: {hourly_visits.sum()}")
print(f" Peak hour: {hourly_visits.max()} visits")
print(f" Quiet hour: {hourly_visits.min()} visits")
# Relationship to exponential
print("\n๐ Poisson-Exponential Connection:")
print(" Poisson: Number of events in fixed time")
print(" Exponential: Time between events")
# If average is 100 visits/hour, time between visits:
time_between = 60 / avg_visits # minutes
print(f" Average time between visits: {time_between:.2f} minutes")
# Real-world example: Call center
print("\nโ๏ธ Example: Call Center Planning")
calls_per_minute = self.rng.poisson(3, size=60) # 60 minutes
print(f"Calls in 1 hour: {calls_per_minute.sum()}")
print(f"Minutes with 0 calls: {(calls_per_minute == 0).sum()}")
print(f"Minutes with >5 calls: {(calls_per_minute > 5).sum()}")
print(f"Staff needed (1 per 2 calls): {int(np.ceil(calls_per_minute.max() / 2))}")
def multivariate_normal(self):
"""Multivariate Normal: Correlated random variables"""
print("\n๐ Multivariate Normal Distribution")
print("=" * 60)
# Correlated variables: Height and Weight
mean = [175, 75] # Mean height (cm), weight (kg)
# Covariance matrix
# Variance of height: 49 (std=7)
# Variance of weight: 100 (std=10)
# Correlation: 0.7
correlation = 0.7
cov = [[49, correlation * 7 * 10],
[correlation * 7 * 10, 100]]
data = self.rng.multivariate_normal(mean, cov, size=1000)
heights = data[:, 0]
weights = data[:, 1]
print("Correlated Variables (Height & Weight):")
print(f" Height: mean={heights.mean():.1f}cm, std={heights.std():.1f}cm")
print(f" Weight: mean={weights.mean():.1f}kg, std={weights.std():.1f}kg")
# Calculate actual correlation
actual_corr = np.corrcoef(heights, weights)[0, 1]
print(f" Correlation (expected): {correlation}")
print(f" Correlation (actual): {actual_corr:.3f}")
# Conditional distribution
tall_people = weights[heights > 185]
short_people = weights[heights < 165]
print(f"\nConditional Distributions:")
print(f" Weight of tall people (>185cm): {tall_people.mean():.1f}kg")
print(f" Weight of short people (<165cm): {short_people.mean():.1f}kg")
print(f" Difference: {tall_people.mean() - short_people.mean():.1f}kg")
# Explore all distributions
explorer = DistributionExplorer()
explorer.uniform_distribution()
explorer.normal_distribution()
explorer.exponential_distribution()
explorer.binomial_distribution()
explorer.poisson_distribution()
explorer.multivariate_normal()
Monte Carlo Simulations: Solving Problems with Randomness ๐ฐ
Monte Carlo methods use random sampling to solve problems that might be deterministic in principle. It's like estimating the area of a lake by randomly throwing stones and counting how many land in the water!
import numpy as np
class MonteCarloSimulations:
"""
Monte Carlo methods: Solving complex problems with random numbers.
When in doubt, simulate it out! ๐ฒ
"""
def __init__(self, seed=42):
self.rng = np.random.RandomState(seed)
def estimate_pi(self, n_samples=1000000):
"""
Estimate ฯ by randomly throwing darts at a square!
Classic Monte Carlo example.
"""
print("๐ฏ Estimating ฯ with Monte Carlo")
print("=" * 60)
# Generate random points in [0,1] x [0,1]
x = self.rng.random(n_samples)
y = self.rng.random(n_samples)
# Check if points fall inside unit circle
inside_circle = (x**2 + y**2) <= 1
# ฯ/4 = area of quarter circle / area of square
pi_estimate = 4 * inside_circle.mean()
print(f"Number of samples: {n_samples:,}")
print(f"Points inside circle: {inside_circle.sum():,}")
print(f"ฯ estimate: {pi_estimate:.6f}")
print(f"Actual ฯ: {np.pi:.6f}")
print(f"Error: {abs(pi_estimate - np.pi):.6f}")
# Show convergence
print("\n๐ Convergence Analysis:")
sample_sizes = [100, 1000, 10000, 100000, 1000000]
for n in sample_sizes:
x_subset = x[:n]
y_subset = y[:n]
inside = (x_subset**2 + y_subset**2) <= 1
estimate = 4 * inside.mean()
error = abs(estimate - np.pi)
print(f" n={n:>7,}: ฯโ{estimate:.5f}, error={error:.5f}")
return pi_estimate
def stock_price_simulation(self, S0=100, mu=0.05, sigma=0.2,
T=1.0, n_steps=252, n_paths=10000):
"""
Simulate stock prices using Geometric Brownian Motion.
The foundation of option pricing!
"""
print("\n๐ Stock Price Monte Carlo Simulation")
print("=" * 60)
print(f"Initial price: ${S0}")
print(f"Annual return (ฮผ): {mu*100:.1f}%")
print(f"Annual volatility (ฯ): {sigma*100:.1f}%")
print(f"Time horizon: {T} year")
print(f"Paths simulated: {n_paths:,}")
# Time steps
dt = T / n_steps
# Generate random shocks
Z = self.rng.randn(n_paths, n_steps)
# Simulate paths using GBM formula
# dS = ฮผS dt + ฯS dW
returns = (mu - 0.5 * sigma**2) * dt + sigma * np.sqrt(dt) * Z
# Calculate prices
price_paths = np.zeros((n_paths, n_steps + 1))
price_paths[:, 0] = S0
for t in range(1, n_steps + 1):
price_paths[:, t] = price_paths[:, t-1] * np.exp(returns[:, t-1])
# Statistics
final_prices = price_paths[:, -1]
print(f"\n๐ Results after {T} year:")
print(f" Mean price: ${final_prices.mean():.2f}")
print(f" Median price: ${np.median(final_prices):.2f}")
print(f" Std deviation: ${final_prices.std():.2f}")
# Percentiles (Value at Risk)
percentiles = [1, 5, 25, 50, 75, 95, 99]
print(f"\n๐ Price Distribution:")
for p in percentiles:
value = np.percentile(final_prices, p)
print(f" {p:>2}th percentile: ${value:.2f}")
# Probability of profit/loss
prob_profit = (final_prices > S0).mean()
prob_double = (final_prices > 2 * S0).mean()
prob_loss_50 = (final_prices < 0.5 * S0).mean()
print(f"\n๐ฒ Probabilities:")
print(f" Profit (>$100): {prob_profit*100:.1f}%")
print(f" Double (>$200): {prob_double*100:.1f}%")
print(f" Loss >50% (<$50): {prob_loss_50*100:.1f}%")
return price_paths
def option_pricing_black_scholes(self, S0=100, K=105, r=0.05,
sigma=0.2, T=0.25, n_simulations=100000):
"""
Price a European call option using Monte Carlo.
Compare with Black-Scholes formula.
"""
print("\n๐ฐ Option Pricing with Monte Carlo")
print("=" * 60)
print("Option Parameters:")
print(f" Current price (Sโ): ${S0}")
print(f" Strike price (K): ${K}")
print(f" Risk-free rate: {r*100:.1f}%")
print(f" Volatility: {sigma*100:.1f}%")
print(f" Time to expiry: {T} years")
# Simulate final prices
Z = self.rng.randn(n_simulations)
ST = S0 * np.exp((r - 0.5 * sigma**2) * T + sigma * np.sqrt(T) * Z)
# Calculate payoffs
payoffs = np.maximum(ST - K, 0) # Call option payoff
# Discount to present value
option_price = np.exp(-r * T) * payoffs.mean()
print(f"\n๐ Monte Carlo Results:")
print(f" Simulations: {n_simulations:,}")
print(f" Option price: ${option_price:.3f}")
# Standard error
std_error = payoffs.std() / np.sqrt(n_simulations) * np.exp(-r * T)
print(f" Standard error: ${std_error:.4f}")
print(f" 95% CI: [${option_price - 1.96*std_error:.3f}, "
f"${option_price + 1.96*std_error:.3f}]")
# Black-Scholes formula for comparison
from scipy.stats import norm
d1 = (np.log(S0/K) + (r + 0.5*sigma**2)*T) / (sigma*np.sqrt(T))
d2 = d1 - sigma*np.sqrt(T)
bs_price = S0*norm.cdf(d1) - K*np.exp(-r*T)*norm.cdf(d2)
print(f"\nโ
Black-Scholes price: ${bs_price:.3f}")
print(f"Monte Carlo error: ${abs(option_price - bs_price):.4f}")
# Greeks estimation via Monte Carlo
print(f"\n๐ข Greeks Estimation:")
# Delta: price sensitivity to stock price
S_up = S0 * 1.01
Z_new = self.rng.randn(n_simulations)
ST_up = S_up * np.exp((r - 0.5*sigma**2)*T + sigma*np.sqrt(T)*Z_new)
payoffs_up = np.maximum(ST_up - K, 0)
price_up = np.exp(-r*T) * payoffs_up.mean()
delta = (price_up - option_price) / (S_up - S0)
print(f" Delta (โV/โS): {delta:.4f}")
return option_price
def portfolio_var_simulation(self, n_assets=5, n_days=252, n_simulations=10000):
"""
Calculate Value at Risk (VaR) for a portfolio.
Risk management through simulation.
"""
print("\nโ ๏ธ Portfolio Value at Risk (VaR)")
print("=" * 60)
# Portfolio setup
weights = self.rng.dirichlet(np.ones(n_assets)) # Random weights sum to 1
initial_value = 1000000 # $1M portfolio
print(f"Portfolio Setup:")
print(f" Initial value: ${initial_value:,}")
print(f" Number of assets: {n_assets}")
print(f" Weights: {weights.round(3)}")
# Historical returns parameters (annualized)
mean_returns = self.rng.uniform(0.05, 0.15, n_assets)
volatilities = self.rng.uniform(0.15, 0.30, n_assets)
# Correlation matrix (realistic correlations)
correlation = 0.3 + 0.5 * self.rng.random((n_assets, n_assets))
np.fill_diagonal(correlation, 1)
correlation = (correlation + correlation.T) / 2 # Symmetrize
# Covariance matrix
cov_matrix = np.outer(volatilities, volatilities) * correlation
# Simulate portfolio returns
daily_returns = self.rng.multivariate_normal(
mean_returns/252,
cov_matrix/252,
size=(n_simulations, n_days)
)
# Calculate portfolio returns
portfolio_returns = daily_returns @ weights
# Calculate portfolio values
portfolio_values = initial_value * np.cumprod(1 + portfolio_returns, axis=1)
final_values = portfolio_values[:, -1]
# VaR calculations
var_95 = initial_value - np.percentile(final_values, 5)
var_99 = initial_value - np.percentile(final_values, 1)
cvar_95 = initial_value - final_values[final_values <= np.percentile(final_values, 5)].mean()
print(f"\n๐ Risk Metrics (1 year horizon):")
print(f" VaR 95%: ${var_95:,.2f}")
print(f" VaR 99%: ${var_99:,.2f}")
print(f" CVaR 95% (Expected Shortfall): ${cvar_95:,.2f}")
# Probability of various losses
prob_loss = (final_values < initial_value).mean()
prob_loss_10 = (final_values < 0.9 * initial_value).mean()
prob_loss_25 = (final_values < 0.75 * initial_value).mean()
print(f"\n๐ฏ Loss Probabilities:")
print(f" Any loss: {prob_loss*100:.1f}%")
print(f" Loss > 10%: {prob_loss_10*100:.1f}%")
print(f" Loss > 25%: {prob_loss_25*100:.1f}%")
# Best and worst scenarios
print(f"\n๐ Scenario Analysis:")
print(f" Best case (99th percentile): ${np.percentile(final_values, 99):,.2f}")
print(f" Worst case (1st percentile): ${np.percentile(final_values, 1):,.2f}")
print(f" Expected value: ${final_values.mean():,.2f}")
return final_values
# Run Monte Carlo simulations
mc = MonteCarloSimulations()
mc.estimate_pi()
price_paths = mc.stock_price_simulation()
option_price = mc.option_pricing_black_scholes()
portfolio_values = mc.portfolio_var_simulation()
Random Sampling and Bootstrapping ๐ฏ
Random sampling is the foundation of statistical inference. Bootstrapping takes this further - it's like creating parallel universes of your data to understand uncertainty!
import numpy as np
class RandomSampling:
"""
Advanced sampling techniques for data science.
From simple random sampling to sophisticated bootstrap methods!
"""
def __init__(self, seed=42):
self.rng = np.random.RandomState(seed)
def sampling_methods(self):
"""Demonstrate various sampling techniques"""
print("๐ฏ Sampling Methods")
print("=" * 60)
# Create a population
population = np.arange(1, 101) # 100 items
# 1. Simple Random Sampling
print("1๏ธโฃ Simple Random Sampling:")
simple_sample = self.rng.choice(population, size=10, replace=False)
print(f" Sample: {sorted(simple_sample)}")
# 2. Systematic Sampling
print("\n2๏ธโฃ Systematic Sampling:")
k = len(population) // 10 # Every kth element
start = self.rng.randint(0, k)
systematic_sample = population[start::k]
print(f" Every {k}th element starting at {start}: {systematic_sample}")
# 3. Stratified Sampling
print("\n3๏ธโฃ Stratified Sampling:")
# Divide into strata
strata1 = population[:25] # Low
strata2 = population[25:75] # Medium
strata3 = population[75:] # High
# Sample from each stratum
sample1 = self.rng.choice(strata1, size=3, replace=False)
sample2 = self.rng.choice(strata2, size=5, replace=False)
sample3 = self.rng.choice(strata3, size=2, replace=False)
stratified_sample = np.concatenate([sample1, sample2, sample3])
print(f" Low stratum: {sorted(sample1)}")
print(f" Medium stratum: {sorted(sample2)}")
print(f" High stratum: {sorted(sample3)}")
# 4. Weighted Sampling
print("\n4๏ธโฃ Weighted Sampling:")
# Higher numbers have higher probability
weights = population / population.sum()
weighted_sample = self.rng.choice(population, size=10,
replace=True, p=weights)
print(f" Sample (biased toward higher values): {sorted(weighted_sample)}")
# 5. Reservoir Sampling (for streaming data)
print("\n5๏ธโฃ Reservoir Sampling (streaming):")
def reservoir_sampling(stream, k):
"""Sample k items from a stream of unknown length"""
reservoir = []
for i, item in enumerate(stream):
if i < k:
reservoir.append(item)
else:
# Replace with probability k/(i+1)
j = self.rng.randint(0, i + 1)
if j < k:
reservoir[j] = item
return reservoir
# Simulate a stream
stream = iter(range(1, 1001)) # 1000 items
reservoir = reservoir_sampling(stream, 10)
print(f" Sample from stream: {sorted(reservoir)}")
def bootstrap_confidence_intervals(self):
"""
Bootstrap: Estimate uncertainty by resampling.
The statistician's time machine!
"""
print("\n๐ Bootstrap Confidence Intervals")
print("=" * 60)
# Original sample (e.g., customer satisfaction scores)
original_data = self.rng.normal(75, 10, size=100)
print(f"Original sample:")
print(f" Size: {len(original_data)}")
print(f" Mean: {original_data.mean():.2f}")
print(f" Std: {original_data.std():.2f}")
# Bootstrap resampling
n_bootstrap = 10000
bootstrap_means = []
for _ in range(n_bootstrap):
# Resample with replacement
bootstrap_sample = self.rng.choice(original_data,
size=len(original_data),
replace=True)
bootstrap_means.append(bootstrap_sample.mean())
bootstrap_means = np.array(bootstrap_means)
# Calculate confidence intervals
confidence_levels = [90, 95, 99]
print(f"\n๐ Bootstrap Results ({n_bootstrap} resamples):")
print(f" Bootstrap mean: {bootstrap_means.mean():.2f}")
print(f" Bootstrap SE: {bootstrap_means.std():.3f}")
print(f"\n๐ฏ Confidence Intervals:")
for conf in confidence_levels:
alpha = (100 - conf) / 2
lower = np.percentile(bootstrap_means, alpha)
upper = np.percentile(bootstrap_means, 100 - alpha)
print(f" {conf}% CI: [{lower:.2f}, {upper:.2f}]")
# Compare with theoretical CI (assuming normal)
theoretical_se = original_data.std() / np.sqrt(len(original_data))
theoretical_ci_95 = [
original_data.mean() - 1.96 * theoretical_se,
original_data.mean() + 1.96 * theoretical_se
]
print(f"\n๐ Theoretical 95% CI: [{theoretical_ci_95[0]:.2f}, {theoretical_ci_95[1]:.2f}]")
return bootstrap_means
def permutation_testing(self):
"""
Permutation tests: Non-parametric hypothesis testing.
When assumptions don't hold, shuffle it!
"""
print("\n๐ Permutation Testing")
print("=" * 60)
# Two groups (e.g., treatment vs control)
group_A = self.rng.normal(100, 15, size=50) # Control
group_B = self.rng.normal(105, 15, size=50) # Treatment (5 point improvement)
observed_diff = group_B.mean() - group_A.mean()
print("Hypothesis Test: Is there a difference between groups?")
print(f" Group A mean: {group_A.mean():.2f}")
print(f" Group B mean: {group_B.mean():.2f}")
print(f" Observed difference: {observed_diff:.2f}")
# Permutation test
n_permutations = 10000
combined = np.concatenate([group_A, group_B])
permuted_diffs = []
for _ in range(n_permutations):
# Shuffle and split
self.rng.shuffle(combined)
perm_A = combined[:50]
perm_B = combined[50:]
# Calculate difference
perm_diff = perm_B.mean() - perm_A.mean()
permuted_diffs.append(perm_diff)
permuted_diffs = np.array(permuted_diffs)
# Calculate p-value
p_value = (np.abs(permuted_diffs) >= np.abs(observed_diff)).mean()
print(f"\n๐ Permutation Test Results:")
print(f" Permutations: {n_permutations}")
print(f" P-value: {p_value:.4f}")
print(f" Significant at ฮฑ=0.05? {'Yes โ
' if p_value < 0.05 else 'No โ'}")
# Effect size
pooled_std = np.sqrt((group_A.var() + group_B.var()) / 2)
cohens_d = observed_diff / pooled_std
print(f" Cohen's d (effect size): {cohens_d:.3f}")
return permuted_diffs
def cross_validation_splits(self):
"""
Random splits for cross-validation.
Essential for machine learning!
"""
print("\nโ๏ธ Cross-Validation Splits")
print("=" * 60)
# Dataset
n_samples = 100
X = np.arange(n_samples).reshape(-1, 1) # Features
y = np.arange(n_samples) # Labels
# 1. Train/Test Split
print("1๏ธโฃ Train/Test Split (80/20):")
indices = np.arange(n_samples)
self.rng.shuffle(indices)
train_size = int(0.8 * n_samples)
train_idx = indices[:train_size]
test_idx = indices[train_size:]
print(f" Train indices (first 10): {sorted(train_idx[:10])}")
print(f" Test indices (first 10): {sorted(test_idx[:10])}")
# 2. K-Fold Cross-Validation
print("\n2๏ธโฃ K-Fold Cross-Validation (k=5):")
k = 5
fold_size = n_samples // k
for fold in range(k):
start = fold * fold_size
end = start + fold_size if fold < k-1 else n_samples
val_idx = indices[start:end]
train_idx = np.concatenate([indices[:start], indices[end:]])
print(f" Fold {fold+1}: Val size={len(val_idx)}, Train size={len(train_idx)}")
# 3. Stratified K-Fold (for imbalanced classes)
print("\n3๏ธโฃ Stratified K-Fold (preserving class distribution):")
# Create imbalanced labels
y_imbalanced = np.array([0]*70 + [1]*30) # 70% class 0, 30% class 1
# Stratified split
class_0_idx = np.where(y_imbalanced == 0)[0]
class_1_idx = np.where(y_imbalanced == 1)[0]
self.rng.shuffle(class_0_idx)
self.rng.shuffle(class_1_idx)
fold_size_0 = len(class_0_idx) // k
fold_size_1 = len(class_1_idx) // k
for fold in range(k):
val_0 = class_0_idx[fold*fold_size_0:(fold+1)*fold_size_0]
val_1 = class_1_idx[fold*fold_size_1:(fold+1)*fold_size_1]
val_idx = np.concatenate([val_0, val_1])
print(f" Fold {fold+1}: Class 0={len(val_0)}, Class 1={len(val_1)}")
# Explore sampling techniques
sampler = RandomSampling()
sampler.sampling_methods()
bootstrap_means = sampler.bootstrap_confidence_intervals()
permuted_diffs = sampler.permutation_testing()
sampler.cross_validation_splits()
Best Practices and Common Pitfalls ๐ฏ
๐ฑ Always Set Seeds for Reproducibility
# Good: Reproducible
np.random.seed(42)
results = np.random.random(100)
# Better: Use RandomState
rng = np.random.RandomState(42)
results = rng.random(100)
# Best: Use new Generator (NumPy 1.17+)
rng = np.random.default_rng(42)
results = rng.random(100)
โ ๏ธ Common Pitfalls
- Not setting seeds: Results change every run
- Using global random state: Functions affect each other
- Wrong distribution: Using uniform when you need normal
- Ignoring numerical stability: Log of zero, division issues
- Sample size too small: Monte Carlo needs many samples
โ Best Practices
- Use appropriate distributions: Match the problem domain
- Check convergence: Increase samples until stable
- Validate with known solutions: Test on problems with analytical answers
- Document random choices: Explain why you chose specific distributions
- Use vectorization: Generate all random numbers at once
Summary: Your Random Number Toolkit ๐ฒ
# Quick Random Reference Card
import numpy as np
# SETUP
# -----
rng = np.random.RandomState(42) # Reproducible generator
# BASIC GENERATION
# ----------------
rng.random(size) # Uniform [0, 1)
rng.randint(low, high, size) # Integers [low, high)
rng.choice(array, size) # Random selection
rng.shuffle(array) # In-place shuffle
rng.permutation(array) # Return shuffled copy
# DISTRIBUTIONS
# -------------
rng.uniform(low, high, size) # Uniform
rng.normal(mu, sigma, size) # Normal/Gaussian
rng.exponential(scale, size) # Exponential
rng.poisson(lam, size) # Poisson
rng.binomial(n, p, size) # Binomial
rng.multivariate_normal(mean, cov, size) # Multivariate
# SAMPLING
# --------
# Simple random sample
sample = rng.choice(population, size=n, replace=False)
# Bootstrap sample
boot = rng.choice(data, size=len(data), replace=True)
# Stratified sample
strata_samples = [rng.choice(s, size=n) for s in strata]
# MONTE CARLO PATTERN
# -------------------
n_simulations = 10000
results = []
for _ in range(n_simulations):
simulation = rng.random() # Your random process
results.append(simulation)
estimate = np.mean(results)
std_error = np.std(results) / np.sqrt(n_simulations)
print("๐ฒ Master randomness to master data science!")
๐ฏ Key Takeaways: