Master the art of importing data from various sources into pandas DataFrames
Think of pandas as a universal translator for data formats. Just like a skilled interpreter who can understand multiple languages and convert between them seamlessly, pandas can read data from virtually any source - CSV files, Excel spreadsheets, JSON documents, SQL databases, and more!
In the real world, data rarely comes in a single, perfect format. You might receive sales data in Excel, customer feedback in CSV, API responses in JSON, and transaction records in a SQL database. Pandas is your Swiss Army knife for bringing all this data together.
The bread and butter of data exchange - simple, universal, and human-readable
# Basic CSV loading
df = pd.read_csv('sales_data.csv')
# With custom settings
df = pd.read_csv('data.csv',
sep=';', # European CSV with semicolon
encoding='utf-8', # Handle special characters
parse_dates=['date_column'], # Auto-parse dates
index_col='id' # Use 'id' column as index
)
Corporate data's favorite format - multiple sheets, formatting, and formulas
# Read specific sheet
df = pd.read_excel('report.xlsx', sheet_name='Q1_Sales')
# Read multiple sheets
all_sheets = pd.read_excel('report.xlsx', sheet_name=None)
# Returns dictionary: {'Sheet1': df1, 'Sheet2': df2, ...}
# Skip rows and use specific columns
df = pd.read_excel('data.xlsx',
skiprows=2, # Skip header rows
usecols='A:E', # Only columns A to E
na_values=['N/A', 'Missing'] # Custom NA values
)
The language of web APIs - nested, hierarchical, and flexible
# Simple JSON
df = pd.read_json('api_response.json')
# Nested JSON with normalization
import json
with open('nested_data.json') as f:
data = json.load(f)
df = pd.json_normalize(data,
record_path=['results'], # Path to records
meta=['timestamp', 'version'] # Include metadata
)
# From API endpoint
df = pd.read_json('https://api.example.com/data')
Enterprise data warehouses - structured, relational, and massive
# Using SQLAlchemy
from sqlalchemy import create_engine
# Connect to database
engine = create_engine('sqlite:///database.db')
# Or: 'postgresql://user:pass@host/db'
# Or: 'mysql://user:pass@host/db'
# Read entire table
df = pd.read_sql_table('customers', engine)
# Custom SQL query
query = """
SELECT customer_id, order_date, total
FROM orders
WHERE order_date >= '2024-01-01'
ORDER BY total DESC
"""
df = pd.read_sql_query(query, engine)
Watch how different file formats are loaded and processed:
When dealing with large files, use these techniques to manage memory:
# Read in chunks
chunk_size = 10000
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
# Process each chunk
processed = chunk.groupby('category').sum()
# Save or aggregate results
# Specify data types to reduce memory
dtypes = {
'product_id': 'category', # Categorical for repeated values
'quantity': 'int32', # Smaller integer type
'price': 'float32' # Single precision float
}
df = pd.read_csv('data.csv', dtype=dtypes)
# Only load specific columns
df = pd.read_csv('data.csv', usecols=['col1', 'col2', 'col3'])
encoding='utf-8' or 'latin-1' for special charactersparse_datesna_valuesImagine you're a data analyst at an e-commerce company. You need to combine data from multiple sources:
# Morning routine: Load yesterday's data from various sources
import pandas as pd
from datetime import datetime, timedelta
from sqlalchemy import create_engine
# 1. Load product catalog from Excel (updated weekly)
products = pd.read_excel('product_catalog.xlsx',
sheet_name='Active_Products')
# 2. Load customer orders from CSV (daily export from order system)
yesterday = (datetime.now() - timedelta(1)).strftime('%Y%m%d')
orders = pd.read_csv(f'orders_{yesterday}.csv',
parse_dates=['order_timestamp'],
dtype={'customer_id': str}) # Keep as string to preserve leading zeros
# 3. Load customer data from SQL database
engine = create_engine('postgresql://analyst:pass@datawarehouse/ecommerce')
customers = pd.read_sql_query("""
SELECT customer_id, segment, lifetime_value, registration_date
FROM customers
WHERE status = 'active'
""", engine)
# 4. Load shipping data from JSON API response
shipping = pd.read_json('https://api.shipping.com/tracking/bulk')
# 5. Combine all data sources
# Merge orders with products
order_details = orders.merge(products, on='product_id', how='left')
# Add customer information
full_data = order_details.merge(customers, on='customer_id', how='left')
# Add shipping status
final_dataset = full_data.merge(shipping, on='order_id', how='left')
print(f"Combined dataset: {final_dataset.shape[0]} rows, {final_dataset.shape[1]} columns")
print(f"Data sources integrated successfully!")
df.head(), df.info(), and df.describe().gz, .zip, .bz2)%%time in Jupyter to measure loading performance