Skip to main content

๐Ÿงน Data Cleaning and Preprocessing

Transform messy, real-world data into pristine datasets ready for analysis

๐ŸŒŸ The Art of Data Cleaning

Imagine you're a detective examining evidence. Some clues are missing, others are misleading, and many are covered in dirt. Data cleaning is like forensic analysis - you must carefully examine, clean, and organize your evidence before solving the case!

In the real world, data scientists spend up to 80% of their time cleaning data. Raw data is like a rough diamond - valuable but needing careful cutting and polishing to reveal its true brilliance. Master these techniques, and you'll transform chaotic datasets into analytical gold!

๐Ÿ”„ The Data Cleaning Pipeline

graph TB A[Raw Data] --> B{Assess Quality} B --> C[Missing Values] B --> D[Duplicates] B --> E[Data Types] B --> F[Outliers] B --> G[Inconsistencies] C --> H[Handle Missing] D --> I[Remove Duplicates] E --> J[Convert Types] F --> K[Process Outliers] G --> L[Standardize] H --> M[Clean Data] I --> M J --> M K --> M L --> M M --> N[Validate] N --> O[Ready for Analysis] style A fill:#faa,stroke:#333,stroke-width:2px style O fill:#afa,stroke:#333,stroke-width:2px style M fill:#aaf,stroke:#333,stroke-width:2px

๐ŸŽฎ Data Quality Analyzer

Watch as we clean a messy dataset in real-time:

30% Clean

๐Ÿ” Detecting Missing Values

Like finding holes in Swiss cheese - you need to know where they are before you can fix them!

# Check for missing values
df.isnull().sum()  # Count per column
df.isnull().sum().sum()  # Total missing

# Visualize missing data pattern
import missingno as msno
msno.matrix(df)  # Visual matrix
msno.bar(df)     # Bar chart

# Find percentage of missing
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
Age     Name    Salary
25      John    NaN
NaN     Alice   50000
30      NaN     60000
Age     Name    Salary
25      John    55000
28      Alice   50000
30      Bob     60000

๐ŸŽฏ Handling Missing Data

Choose your strategy wisely - like a surgeon deciding how to treat a wound!

# Strategy 1: Drop missing values
df.dropna()  # Drop rows with any NaN
df.dropna(how='all')  # Drop rows where ALL values are NaN
df.dropna(subset=['important_col'])  # Drop if specific column is NaN

# Strategy 2: Fill missing values
df.fillna(0)  # Fill with constant
df.fillna(df.mean())  # Fill with mean
df.fillna(method='ffill')  # Forward fill
df.fillna(method='bfill')  # Backward fill

# Strategy 3: Interpolation
df.interpolate(method='linear')  # Linear interpolation
df.interpolate(method='time')    # Time-based

# Strategy 4: Advanced imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

๐Ÿ”„ Removing Duplicates

Like finding twins in a crowd - sometimes they're exact copies, sometimes just very similar!

# Find duplicates
df.duplicated()  # Boolean mask
df.duplicated().sum()  # Count duplicates

# Show duplicate rows
df[df.duplicated(keep=False)]  # Show all duplicates

# Remove duplicates
df.drop_duplicates()  # Remove exact duplicates
df.drop_duplicates(subset=['id'])  # Based on specific columns
df.drop_duplicates(keep='first')  # Keep first occurrence
df.drop_duplicates(keep='last')   # Keep last occurrence

# Fuzzy matching for near-duplicates
from fuzzywuzzy import fuzz
def find_similar(name, names_list):
    return [(n, fuzz.ratio(name, n)) for n in names_list 
            if fuzz.ratio(name, n) > 80]

๐Ÿ”ง Data Type Conversion

Like translating between languages - ensuring everyone speaks the same data dialect!

# Check current data types
df.dtypes
df.info()

# Convert data types
df['price'] = df['price'].astype(float)
df['quantity'] = df['quantity'].astype(int)
df['category'] = df['category'].astype('category')

# Parse dates
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['date'] = pd.to_datetime(df['date'], errors='coerce')  # Handle errors

# Convert numeric with error handling
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')

# Memory optimization
df['category'] = df['category'].astype('category')  # For repeated values
df['small_int'] = df['small_int'].astype('int8')   # Use smaller int types

๐Ÿ“Š Outlier Detection

Finding the rebels in your data - values that march to their own beat!

# Statistical methods
# Z-score method
from scipy import stats
z_scores = stats.zscore(df['value'])
outliers = df[abs(z_scores) > 3]

# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]

# Visual detection
import matplotlib.pyplot as plt
df.boxplot(column='value')
plt.show()

# Handle outliers
# Option 1: Remove
df_clean = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]

# Option 2: Cap values
df['value'] = df['value'].clip(lower=lower_bound, upper=upper_bound)

# Option 3: Transform
df['value_log'] = np.log1p(df['value'])  # Log transformation

โœจ String Cleaning

Polishing text data until it sparkles - consistent, clean, and analysis-ready!

# Basic string cleaning
df['name'] = df['name'].str.strip()  # Remove whitespace
df['name'] = df['name'].str.lower()  # Lowercase
df['name'] = df['name'].str.title()  # Title case

# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)

# Fix inconsistencies
df['category'] = df['category'].replace({
    'cat1': 'Category 1',
    'CAT1': 'Category 1',
    'category_1': 'Category 1'
})

# Extract information
df['domain'] = df['email'].str.extract(r'@([^.]+)')
df['year'] = df['date_string'].str.extract(r'(\d{4})')

# Split strings
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)

# Normalize text
import unicodedata
def normalize_text(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')

๐ŸŒ Real-World Scenario: Customer Data Cleaning

Let's clean a messy customer dataset from an e-commerce platform:

# Load messy customer data
import pandas as pd
import numpy as np
from datetime import datetime

# Simulate real-world messy data
customers = pd.DataFrame({
    'customer_id': ['C001', 'C002', 'C002', 'C003', 'C004', np.nan],
    'name': ['John Smith', 'alice jones', 'Alice Jones', 'BOB WILSON', '  Sarah Lee  ', 'Mike Brown'],
    'email': ['john@email.com', 'ALICE@GMAIL.COM', 'alice@gmail.com', 'bob@', 'sarah@@email.com', 'mike@email'],
    'age': [25, 130, 35, -5, 42, 'thirty'],
    'signup_date': ['2024-01-15', '15/01/2024', '2024/01/15', '2024-13-45', '2024-02-30', '2024-03-10'],
    'purchase_amount': ['$1,234.56', '2345.67', '$3,456.78', 'N/A', '4567.89', '5678.90']
})

print("๐Ÿ”ด BEFORE CLEANING:")
print(customers)
print(f"\nData types:\n{customers.dtypes}")

# Step 1: Handle duplicates
customers = customers.drop_duplicates(subset=['customer_id'], keep='first')

# Step 2: Fix customer IDs
customers['customer_id'] = customers['customer_id'].fillna(
    'C' + customers.index.astype(str).str.zfill(3)
)

# Step 3: Clean names
customers['name'] = customers['name'].str.strip().str.title()

# Step 4: Validate and clean emails
import re
def validate_email(email):
    if pd.isna(email):
        return np.nan
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    email = str(email).lower()
    if re.match(pattern, email):
        return email
    return np.nan

customers['email'] = customers['email'].apply(validate_email)

# Step 5: Clean age column
customers['age'] = pd.to_numeric(customers['age'], errors='coerce')
customers.loc[customers['age'] < 0, 'age'] = np.nan
customers.loc[customers['age'] > 120, 'age'] = np.nan
customers['age'] = customers['age'].fillna(customers['age'].median())

# Step 6: Parse dates
def parse_date(date_str):
    if pd.isna(date_str):
        return np.nan
    for fmt in ['%Y-%m-%d', '%d/%m/%Y', '%Y/%m/%d']:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue
    return np.nan

customers['signup_date'] = customers['signup_date'].apply(parse_date)

# Step 7: Clean purchase amounts
customers['purchase_amount'] = (customers['purchase_amount']
    .str.replace('$', '')
    .str.replace(',', '')
    .replace('N/A', np.nan)
    .astype(float))

# Step 8: Fill missing purchase amounts with median
customers['purchase_amount'] = customers['purchase_amount'].fillna(
    customers['purchase_amount'].median()
)

print("\nโœ… AFTER CLEANING:")
print(customers)
print(f"\nData types:\n{customers.dtypes}")
print(f"\nData Quality Report:")
print(f"- Missing values: {customers.isnull().sum().sum()}")
print(f"- Duplicate IDs: {customers['customer_id'].duplicated().sum()}")
print(f"- Invalid emails: {customers['email'].isnull().sum()}")
print(f"- Data is now clean and ready for analysis!")

๐Ÿ’ก Pro Tips for Data Cleaning

โš ๏ธ Common Pitfalls to Avoid

๐Ÿ“‹ Data Cleaning Checklist

graph TD A[Start] --> B{Missing Values?} B -->|Yes| C[Handle Missing] B -->|No| D{Duplicates?} C --> D D -->|Yes| E[Remove Duplicates] D -->|No| F{Wrong Types?} E --> F F -->|Yes| G[Convert Types] F -->|No| H{Outliers?} G --> H H -->|Yes| I[Process Outliers] H -->|No| J{Inconsistent?} I --> J J -->|Yes| K[Standardize] J -->|No| L[Validate] K --> L L --> M{Quality OK?} M -->|No| B M -->|Yes| N[Complete] style A fill:#aaf,stroke:#333,stroke-width:2px style N fill:#afa,stroke:#333,stroke-width:2px style M fill:#ffa,stroke:#333,stroke-width:2px