📂 Data Loading with Pandas

Master the art of importing data from various sources into pandas DataFrames

🌟 Your Gateway to Data

Think of pandas as a universal translator for data formats. Just like a skilled interpreter who can understand multiple languages and convert between them seamlessly, pandas can read data from virtually any source - CSV files, Excel spreadsheets, JSON documents, SQL databases, and more!

In the real world, data rarely comes in a single, perfect format. You might receive sales data in Excel, customer feedback in CSV, API responses in JSON, and transaction records in a SQL database. Pandas is your Swiss Army knife for bringing all this data together.

🗂️ Data Loading Pipeline

graph LR A[Data Sources] --> B{Pandas Read Functions} B --> C[DataFrame] D[CSV Files] --> E[pd.read_csv] F[Excel Files] --> G[pd.read_excel] H[JSON Files] --> I[pd.read_json] J[SQL Database] --> K[pd.read_sql] L[HTML Tables] --> M[pd.read_html] N[Clipboard] --> O[pd.read_clipboard] E --> C G --> C I --> C K --> C M --> C O --> C style A fill:#f9f,stroke:#333,stroke-width:2px style C fill:#9f9,stroke:#333,stroke-width:2px style B fill:#99f,stroke:#333,stroke-width:2px

📄 CSV Files

The bread and butter of data exchange - simple, universal, and human-readable

Real-world example: Sales reports, sensor data logs, export from spreadsheets

# Basic CSV loading
df = pd.read_csv('sales_data.csv')

# With custom settings
df = pd.read_csv('data.csv',
    sep=';',           # European CSV with semicolon
    encoding='utf-8',  # Handle special characters
    parse_dates=['date_column'],  # Auto-parse dates
    index_col='id'     # Use 'id' column as index
)

📊 Excel Files

Corporate data's favorite format - multiple sheets, formatting, and formulas

Real-world example: Financial reports, inventory sheets, budget planning

# Read specific sheet
df = pd.read_excel('report.xlsx', sheet_name='Q1_Sales')

# Read multiple sheets
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
# Returns dictionary: {'Sheet1': df1, 'Sheet2': df2, ...}

# Skip rows and use specific columns
df = pd.read_excel('data.xlsx',
    skiprows=2,         # Skip header rows
    usecols='A:E',      # Only columns A to E
    na_values=['N/A', 'Missing']  # Custom NA values
)

🔗 JSON Files

The language of web APIs - nested, hierarchical, and flexible

Real-world example: API responses, NoSQL exports, configuration files

# Simple JSON
df = pd.read_json('api_response.json')

# Nested JSON with normalization
import json
with open('nested_data.json') as f:
    data = json.load(f)
df = pd.json_normalize(data, 
    record_path=['results'],  # Path to records
    meta=['timestamp', 'version']  # Include metadata
)

# From API endpoint
df = pd.read_json('https://api.example.com/data')

🗄️ SQL Databases

Enterprise data warehouses - structured, relational, and massive

Real-world example: Customer databases, transaction systems, data warehouses

# Using SQLAlchemy
from sqlalchemy import create_engine

# Connect to database
engine = create_engine('sqlite:///database.db')
# Or: 'postgresql://user:pass@host/db'
# Or: 'mysql://user:pass@host/db'

# Read entire table
df = pd.read_sql_table('customers', engine)

# Custom SQL query
query = """
    SELECT customer_id, order_date, total
    FROM orders
    WHERE order_date >= '2024-01-01'
    ORDER BY total DESC
"""
df = pd.read_sql_query(query, engine)

🎮 Interactive Data Loading Simulator

Watch how different file formats are loaded and processed:

🔍 Advanced Loading Techniques

💡 Memory Optimization

When dealing with large files, use these techniques to manage memory:

# Read in chunks
chunk_size = 10000
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed = chunk.groupby('category').sum()
    # Save or aggregate results

# Specify data types to reduce memory
dtypes = {
    'product_id': 'category',  # Categorical for repeated values
    'quantity': 'int32',        # Smaller integer type
    'price': 'float32'          # Single precision float
}
df = pd.read_csv('data.csv', dtype=dtypes)

# Only load specific columns
df = pd.read_csv('data.csv', usecols=['col1', 'col2', 'col3'])

⚠️ Common Pitfalls

Encoding Issues: Use encoding='utf-8' or 'latin-1' for special characters
Date Parsing: Specify date columns explicitly with parse_dates
Missing Values: Define custom NA values with na_values
Large Files: Use chunking or Dask for files that don't fit in memory

🌍 Real-World Scenario: E-commerce Data Pipeline

Imagine you're a data analyst at an e-commerce company. You need to combine data from multiple sources:

# Morning routine: Load yesterday's data from various sources

import pandas as pd
from datetime import datetime, timedelta
from sqlalchemy import create_engine

# 1. Load product catalog from Excel (updated weekly)
products = pd.read_excel('product_catalog.xlsx', 
                         sheet_name='Active_Products')

# 2. Load customer orders from CSV (daily export from order system)
yesterday = (datetime.now() - timedelta(1)).strftime('%Y%m%d')
orders = pd.read_csv(f'orders_{yesterday}.csv',
                     parse_dates=['order_timestamp'],
                     dtype={'customer_id': str})  # Keep as string to preserve leading zeros

# 3. Load customer data from SQL database
engine = create_engine('postgresql://analyst:pass@datawarehouse/ecommerce')
customers = pd.read_sql_query("""
    SELECT customer_id, segment, lifetime_value, registration_date
    FROM customers
    WHERE status = 'active'
""", engine)

# 4. Load shipping data from JSON API response
shipping = pd.read_json('https://api.shipping.com/tracking/bulk')

# 5. Combine all data sources
# Merge orders with products
order_details = orders.merge(products, on='product_id', how='left')

# Add customer information
full_data = order_details.merge(customers, on='customer_id', how='left')

# Add shipping status
final_dataset = full_data.merge(shipping, on='order_id', how='left')

print(f"Combined dataset: {final_dataset.shape[0]} rows, {final_dataset.shape[1]} columns")
print(f"Data sources integrated successfully!")

📚 Quick Reference Guide

graph TD A[Choose Read Function] --> B{File Type?} B -->|Text/CSV| C[pd.read_csv] B -->|Excel| D[pd.read_excel] B -->|JSON| E[pd.read_json] B -->|SQL| F[pd.read_sql] B -->|HTML| G[pd.read_html] C --> H[Common Parameters] D --> H E --> H F --> H G --> H H --> I[sep/delimiter] H --> J[header/names] H --> K[index_col] H --> L[usecols] H --> M[dtype] H --> N[parse_dates] H --> O[na_values] H --> P[encoding] style A fill:#f9f,stroke:#333,stroke-width:2px style B fill:#99f,stroke:#333,stroke-width:2px style H fill:#9f9,stroke:#333,stroke-width:2px

🚀 Pro Tips

Always inspect your data after loading: Use df.head(), df.info(), and df.describe()
Set index wisely: Choose a unique identifier as index for faster operations
Handle dates early: Parse dates during loading, not after
Use compression: Pandas can read compressed files directly (.gz, .zip, .bz2)
Profile your loading: Use %%time in Jupyter to measure loading performance