Transform messy, real-world data into pristine datasets ready for analysis
Imagine you're a detective examining evidence. Some clues are missing, others are misleading, and many are covered in dirt. Data cleaning is like forensic analysis - you must carefully examine, clean, and organize your evidence before solving the case!
In the real world, data scientists spend up to 80% of their time cleaning data. Raw data is like a rough diamond - valuable but needing careful cutting and polishing to reveal its true brilliance. Master these techniques, and you'll transform chaotic datasets into analytical gold!
Watch as we clean a messy dataset in real-time:
Like finding holes in Swiss cheese - you need to know where they are before you can fix them!
# Check for missing values
df.isnull().sum() # Count per column
df.isnull().sum().sum() # Total missing
# Visualize missing data pattern
import missingno as msno
msno.matrix(df) # Visual matrix
msno.bar(df) # Bar chart
# Find percentage of missing
missing_pct = (df.isnull().sum() / len(df)) * 100
print(missing_pct[missing_pct > 0].sort_values(ascending=False))
Age Name Salary
25 John NaN
NaN Alice 50000
30 NaN 60000
Age Name Salary
25 John 55000
28 Alice 50000
30 Bob 60000
Choose your strategy wisely - like a surgeon deciding how to treat a wound!
# Strategy 1: Drop missing values
df.dropna() # Drop rows with any NaN
df.dropna(how='all') # Drop rows where ALL values are NaN
df.dropna(subset=['important_col']) # Drop if specific column is NaN
# Strategy 2: Fill missing values
df.fillna(0) # Fill with constant
df.fillna(df.mean()) # Fill with mean
df.fillna(method='ffill') # Forward fill
df.fillna(method='bfill') # Backward fill
# Strategy 3: Interpolation
df.interpolate(method='linear') # Linear interpolation
df.interpolate(method='time') # Time-based
# Strategy 4: Advanced imputation
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
Like finding twins in a crowd - sometimes they're exact copies, sometimes just very similar!
# Find duplicates
df.duplicated() # Boolean mask
df.duplicated().sum() # Count duplicates
# Show duplicate rows
df[df.duplicated(keep=False)] # Show all duplicates
# Remove duplicates
df.drop_duplicates() # Remove exact duplicates
df.drop_duplicates(subset=['id']) # Based on specific columns
df.drop_duplicates(keep='first') # Keep first occurrence
df.drop_duplicates(keep='last') # Keep last occurrence
# Fuzzy matching for near-duplicates
from fuzzywuzzy import fuzz
def find_similar(name, names_list):
return [(n, fuzz.ratio(name, n)) for n in names_list
if fuzz.ratio(name, n) > 80]
Like translating between languages - ensuring everyone speaks the same data dialect!
# Check current data types
df.dtypes
df.info()
# Convert data types
df['price'] = df['price'].astype(float)
df['quantity'] = df['quantity'].astype(int)
df['category'] = df['category'].astype('category')
# Parse dates
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['date'] = pd.to_datetime(df['date'], errors='coerce') # Handle errors
# Convert numeric with error handling
df['amount'] = pd.to_numeric(df['amount'], errors='coerce')
# Memory optimization
df['category'] = df['category'].astype('category') # For repeated values
df['small_int'] = df['small_int'].astype('int8') # Use smaller int types
Finding the rebels in your data - values that march to their own beat!
# Statistical methods
# Z-score method
from scipy import stats
z_scores = stats.zscore(df['value'])
outliers = df[abs(z_scores) > 3]
# IQR method
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
# Visual detection
import matplotlib.pyplot as plt
df.boxplot(column='value')
plt.show()
# Handle outliers
# Option 1: Remove
df_clean = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]
# Option 2: Cap values
df['value'] = df['value'].clip(lower=lower_bound, upper=upper_bound)
# Option 3: Transform
df['value_log'] = np.log1p(df['value']) # Log transformation
Polishing text data until it sparkles - consistent, clean, and analysis-ready!
# Basic string cleaning
df['name'] = df['name'].str.strip() # Remove whitespace
df['name'] = df['name'].str.lower() # Lowercase
df['name'] = df['name'].str.title() # Title case
# Remove special characters
df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
# Fix inconsistencies
df['category'] = df['category'].replace({
'cat1': 'Category 1',
'CAT1': 'Category 1',
'category_1': 'Category 1'
})
# Extract information
df['domain'] = df['email'].str.extract(r'@([^.]+)')
df['year'] = df['date_string'].str.extract(r'(\d{4})')
# Split strings
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)
# Normalize text
import unicodedata
def normalize_text(text):
return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
Let's clean a messy customer dataset from an e-commerce platform:
# Load messy customer data
import pandas as pd
import numpy as np
from datetime import datetime
# Simulate real-world messy data
customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C002', 'C003', 'C004', np.nan],
'name': ['John Smith', 'alice jones', 'Alice Jones', 'BOB WILSON', ' Sarah Lee ', 'Mike Brown'],
'email': ['john@email.com', 'ALICE@GMAIL.COM', 'alice@gmail.com', 'bob@', 'sarah@@email.com', 'mike@email'],
'age': [25, 130, 35, -5, 42, 'thirty'],
'signup_date': ['2024-01-15', '15/01/2024', '2024/01/15', '2024-13-45', '2024-02-30', '2024-03-10'],
'purchase_amount': ['$1,234.56', '2345.67', '$3,456.78', 'N/A', '4567.89', '5678.90']
})
print("๐ด BEFORE CLEANING:")
print(customers)
print(f"\nData types:\n{customers.dtypes}")
# Step 1: Handle duplicates
customers = customers.drop_duplicates(subset=['customer_id'], keep='first')
# Step 2: Fix customer IDs
customers['customer_id'] = customers['customer_id'].fillna(
'C' + customers.index.astype(str).str.zfill(3)
)
# Step 3: Clean names
customers['name'] = customers['name'].str.strip().str.title()
# Step 4: Validate and clean emails
import re
def validate_email(email):
if pd.isna(email):
return np.nan
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
email = str(email).lower()
if re.match(pattern, email):
return email
return np.nan
customers['email'] = customers['email'].apply(validate_email)
# Step 5: Clean age column
customers['age'] = pd.to_numeric(customers['age'], errors='coerce')
customers.loc[customers['age'] < 0, 'age'] = np.nan
customers.loc[customers['age'] > 120, 'age'] = np.nan
customers['age'] = customers['age'].fillna(customers['age'].median())
# Step 6: Parse dates
def parse_date(date_str):
if pd.isna(date_str):
return np.nan
for fmt in ['%Y-%m-%d', '%d/%m/%Y', '%Y/%m/%d']:
try:
return pd.to_datetime(date_str, format=fmt)
except:
continue
return np.nan
customers['signup_date'] = customers['signup_date'].apply(parse_date)
# Step 7: Clean purchase amounts
customers['purchase_amount'] = (customers['purchase_amount']
.str.replace('$', '')
.str.replace(',', '')
.replace('N/A', np.nan)
.astype(float))
# Step 8: Fill missing purchase amounts with median
customers['purchase_amount'] = customers['purchase_amount'].fillna(
customers['purchase_amount'].median()
)
print("\nโ
AFTER CLEANING:")
print(customers)
print(f"\nData types:\n{customers.dtypes}")
print(f"\nData Quality Report:")
print(f"- Missing values: {customers.isnull().sum().sum()}")
print(f"- Duplicate IDs: {customers['customer_id'].duplicated().sum()}")
print(f"- Invalid emails: {customers['email'].isnull().sum()}")
print(f"- Data is now clean and ready for analysis!")
assert df['age'].min() >= 0pandas-profiling to get an overview before cleaning