🧹 Data Cleaning and Preprocessing

Transform messy, real-world data into pristine datasets ready for analysis

🌟 The Art of Data Cleaning

Imagine you're a detective examining evidence. Some clues are missing, others are misleading, and many are covered in dirt. Data cleaning is like forensic analysis - you must carefully examine, clean, and organize your evidence before solving the case!

In the real world, data scientists spend up to 80% of their time cleaning data. Raw data is like a rough diamond - valuable but needing careful cutting and polishing to reveal its true brilliance. Master these techniques, and you'll transform chaotic datasets into analytical gold!

🔄 The Data Cleaning Pipeline

graph TB A[Raw Data] --> B{Assess Quality} B --> C[Missing Values] B --> D[Duplicates] B --> E[Data Types] B --> F[Outliers] B --> G[Inconsistencies] C --> H[Handle Missing] D --> I[Remove Duplicates] E --> J[Convert Types] F --> K[Process Outliers] G --> L[Standardize] H --> M[Clean Data] I --> M J --> M K --> M L --> M M --> N[Validate] N --> O[Ready for Analysis] style A fill:#faa,stroke:#333,stroke-width:2px style O fill:#afa,stroke:#333,stroke-width:2px style M fill:#aaf,stroke:#333,stroke-width:2px

🎮 Data Quality Analyzer

Watch as we clean a messy dataset in real-time:

30% Clean

🔍 Detecting Missing Values

Like finding holes in Swiss cheese - you need to know where they are before you can fix them!

# Check for missing values
df.isnull().sum()  # Count per column
df.isnull().sum().sum()  # Total missing

# Visualize missing data pattern
import missingno as msno
msno.matrix(df)  # Visual matrix
msno.bar(df)     # Bar chart

# Find percentage of missing
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

Age     Name    Salary
25      John    NaN
NaN     Alice   50000
30      NaN     60000

Age     Name    Salary
25      John    55000
28      Alice   50000
30      Bob     60000

🎯 Handling Missing Data

Choose your strategy wisely - like a surgeon deciding how to treat a wound!

# Strategy 1: Drop missing values
df.dropna()  # Drop rows with any NaN
df.dropna(how='all')  # Drop rows where ALL values are NaN
df.dropna(subset=['important_col'])  # Drop if specific column is NaN

# Strategy 2: Fill missing values
df.fillna(0)  # Fill with constant
df.fillna(df.mean())  # Fill with mean
df.fillna(method='ffill')  # Forward fill
df.fillna(method='bfill')  # Backward fill

# Strategy 3: Interpolation
df.interpolate(method='linear')  # Linear interpolation
df.interpolate(method='time')    # Time-based

# Strategy 4: Advanced imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

🔄 Removing Duplicates

Like finding twins in a crowd - sometimes they're exact copies, sometimes just very similar!

# Find duplicates
df.duplicated()  # Boolean mask
df.duplicated().sum()  # Count duplicates

# Show duplicate rows
df[df.duplicated(keep=False)]  # Show all duplicates

# Remove duplicates
df.drop_duplicates()  # Remove exact duplicates
df.drop_duplicates(subset=['id'])  # Based on specific columns
df.drop_duplicates(keep='first')  # Keep first occurrence
df.drop_duplicates(keep='last')   # Keep last occurrence

# Fuzzy matching for near-duplicates
from fuzzywuzzy import fuzz
def find_similar(name, names_list):
    return [(n, fuzz.ratio(name, n)) for n in names_list 
            if fuzz.ratio(name, n) > 80]

🔧 Data Type Conversion

Like translating between languages - ensuring everyone speaks the same data dialect!

# Check current data types
df.dtypes
df.info()

# Convert data types
df['price'] = df['price'].astype(float)
df['quantity'] = df['quantity'].astype(int)
df['category'] = df['category'].astype('category')

# Parse dates
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['date'] = pd.to_datetime(df['date'], errors='coerce')  # Handle errors

# Convert numeric with error handling
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# Memory optimization
df['category'] = df['category'].astype('category')  # For repeated values
df['small_int'] = df['small_int'].astype('int8')   # Use smaller int types

📊 Outlier Detection

Finding the rebels in your data - values that march to their own beat!

# Statistical methods
# Z-score method
from scipy import stats
z_scores = stats.zscore(df['value'])
outliers = df[abs(z_scores) > 3]

# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]

# Visual detection
import matplotlib.pyplot as plt
df.boxplot(column='value')
plt.show()

# Handle outliers
# Option 1: Remove
df_clean = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]

# Option 2: Cap values
df['value'] = df['value'].clip(lower=lower_bound, upper=upper_bound)

# Option 3: Transform
df['value_log'] = np.log1p(df['value'])  # Log transformation

✨ String Cleaning

Polishing text data until it sparkles - consistent, clean, and analysis-ready!

# Basic string cleaning
df['name'] = df['name'].str.strip()  # Remove whitespace
df['name'] = df['name'].str.lower()  # Lowercase
df['name'] = df['name'].str.title()  # Title case

# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

# Fix inconsistencies
df['category'] = df['category'].replace({
    'cat1': 'Category 1',
    'CAT1': 'Category 1',
    'category_1': 'Category 1'
})

# Extract information
df['domain'] = df['email'].str.extract(r'@([^.]+)')
df['year'] = df['date_string'].str.extract(r'(\d{4})')

# Split strings
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

# Normalize text
import unicodedata
def normalize_text(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')

🌍 Real-World Scenario: Customer Data Cleaning

Let's clean a messy customer dataset from an e-commerce platform:

# Load messy customer data
import pandas as pd
import numpy as np
from datetime import datetime

# Simulate real-world messy data
customers = pd.DataFrame({
    'customer_id': ['C001', 'C002', 'C002', 'C003', 'C004', np.nan],
    'name': ['John Smith', 'alice jones', 'Alice Jones', 'BOB WILSON', '  Sarah Lee  ', 'Mike Brown'],
    'email': ['john@email.com', 'ALICE@GMAIL.COM', 'alice@gmail.com', 'bob@', 'sarah@@email.com', 'mike@email'],
    'age': [25, 130, 35, -5, 42, 'thirty'],
    'signup_date': ['2024-01-15', '15/01/2024', '2024/01/15', '2024-13-45', '2024-02-30', '2024-03-10'],
    'purchase_amount': ['$1,234.56', '2345.67', '$3,456.78', 'N/A', '4567.89', '5678.90']
})

print("🔴 BEFORE CLEANING:")
print(customers)
print(f"\nData types:\n{customers.dtypes}")

# Step 1: Handle duplicates
customers = customers.drop_duplicates(subset=['customer_id'], keep='first')

# Step 2: Fix customer IDs
customers['customer_id'] = customers['customer_id'].fillna(
    'C' + customers.index.astype(str).str.zfill(3)
)

# Step 3: Clean names
customers['name'] = customers['name'].str.strip().str.title()

# Step 4: Validate and clean emails
import re
def validate_email(email):
    if pd.isna(email):
        return np.nan
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    email = str(email).lower()
    if re.match(pattern, email):
        return email
    return np.nan

customers['email'] = customers['email'].apply(validate_email)

# Step 5: Clean age column
customers['age'] = pd.to_numeric(customers['age'], errors='coerce')
customers.loc[customers['age'] < 0, 'age'] = np.nan
customers.loc[customers['age'] > 120, 'age'] = np.nan
customers['age'] = customers['age'].fillna(customers['age'].median())

# Step 6: Parse dates
def parse_date(date_str):
    if pd.isna(date_str):
        return np.nan
    for fmt in ['%Y-%m-%d', '%d/%m/%Y', '%Y/%m/%d']:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue
    return np.nan

customers['signup_date'] = customers['signup_date'].apply(parse_date)

# Step 7: Clean purchase amounts
customers['purchase_amount'] = (customers['purchase_amount']
    .str.replace('$', '')
    .str.replace(',', '')
    .replace('N/A', np.nan)
    .astype(float))

# Step 8: Fill missing purchase amounts with median
customers['purchase_amount'] = customers['purchase_amount'].fillna(
    customers['purchase_amount'].median()
)

print("\n✅ AFTER CLEANING:")
print(customers)
print(f"\nData types:\n{customers.dtypes}")
print(f"\nData Quality Report:")
print(f"- Missing values: {customers.isnull().sum().sum()}")
print(f"- Duplicate IDs: {customers['customer_id'].duplicated().sum()}")
print(f"- Invalid emails: {customers['email'].isnull().sum()}")
print(f"- Data is now clean and ready for analysis!")

💡 Pro Tips for Data Cleaning

Document everything: Keep a log of all cleaning steps for reproducibility
Never modify original data: Always work on a copy
Validate after each step: Check that your cleaning didn't introduce new problems
Use assertions: Add checks like assert df['age'].min() >= 0
Create a cleaning pipeline: Build reusable functions for common cleaning tasks
Profile first: Use pandas-profiling to get an overview before cleaning

⚠️ Common Pitfalls to Avoid

Over-cleaning: Don't remove outliers that might be legitimate extreme values
Information loss: Be careful when dropping rows - you might lose important data
Biased imputation: Simple mean/median filling can introduce bias
Cascading errors: One bad cleaning step can corrupt the entire dataset
Ignoring domain knowledge: Always consult subject matter experts about data anomalies

📋 Data Cleaning Checklist

graph TD A[Start] --> B{Missing Values?} B -->|Yes| C[Handle Missing] B -->|No| D{Duplicates?} C --> D D -->|Yes| E[Remove Duplicates] D -->|No| F{Wrong Types?} E --> F F -->|Yes| G[Convert Types] F -->|No| H{Outliers?} G --> H H -->|Yes| I[Process Outliers] H -->|No| J{Inconsistent?} I --> J J -->|Yes| K[Standardize] J -->|No| L[Validate] K --> L L --> M{Quality OK?} M -->|No| B M -->|Yes| N[Complete] style A fill:#aaf,stroke:#333,stroke-width:2px style N fill:#afa,stroke:#333,stroke-width:2px style M fill:#ffa,stroke:#333,stroke-width:2px