Skip to main content

Complex Queries and Joins

Master Advanced SQL for Data Analysis! 🎯

Complex SQL queries are the backbone of data analysis. From multi-table joins to recursive CTEs, mastering these techniques enables you to extract insights from relational databases efficiently. Learn to write queries that would make any database administrator proud!

Setting Up Our Practice Database

Let's create a comprehensive e-commerce database for our examples:

import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

# Create connection
conn = sqlite3.connect('ecommerce.db')
cursor = conn.cursor()

# Create tables
cursor.executescript('''
    -- Customers table
    CREATE TABLE IF NOT EXISTS customers (
        customer_id INTEGER PRIMARY KEY AUTOINCREMENT,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        email TEXT UNIQUE NOT NULL,
        city TEXT,
        country TEXT,
        registration_date DATE,
        customer_segment TEXT CHECK(customer_segment IN ('Regular', 'Premium', 'VIP'))
    );
    
    -- Products table
    CREATE TABLE IF NOT EXISTS products (
        product_id INTEGER PRIMARY KEY AUTOINCREMENT,
        product_name TEXT NOT NULL,
        category TEXT,
        subcategory TEXT,
        price DECIMAL(10,2),
        cost DECIMAL(10,2),
        supplier_id INTEGER,
        stock_quantity INTEGER DEFAULT 0,
        discontinued BOOLEAN DEFAULT FALSE
    );
    
    -- Orders table
    CREATE TABLE IF NOT EXISTS orders (
        order_id INTEGER PRIMARY KEY AUTOINCREMENT,
        customer_id INTEGER,
        order_date DATETIME,
        ship_date DATETIME,
        status TEXT CHECK(status IN ('Pending', 'Processing', 'Shipped', 'Delivered', 'Cancelled')),
        total_amount DECIMAL(10,2),
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
    );
    
    -- Order details table
    CREATE TABLE IF NOT EXISTS order_details (
        order_detail_id INTEGER PRIMARY KEY AUTOINCREMENT,
        order_id INTEGER,
        product_id INTEGER,
        quantity INTEGER,
        unit_price DECIMAL(10,2),
        discount DECIMAL(3,2) DEFAULT 0,
        FOREIGN KEY (order_id) REFERENCES orders(order_id),
        FOREIGN KEY (product_id) REFERENCES products(product_id)
    );
    
    -- Suppliers table
    CREATE TABLE IF NOT EXISTS suppliers (
        supplier_id INTEGER PRIMARY KEY AUTOINCREMENT,
        supplier_name TEXT NOT NULL,
        contact_name TEXT,
        country TEXT,
        rating INTEGER CHECK(rating >= 1 AND rating <= 5)
    );
    
    -- Employee table
    CREATE TABLE IF NOT EXISTS employees (
        employee_id INTEGER PRIMARY KEY AUTOINCREMENT,
        first_name TEXT NOT NULL,
        last_name TEXT NOT NULL,
        department TEXT,
        hire_date DATE,
        salary DECIMAL(10,2),
        manager_id INTEGER,
        FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
    );
    
    -- Reviews table
    CREATE TABLE IF NOT EXISTS reviews (
        review_id INTEGER PRIMARY KEY AUTOINCREMENT,
        product_id INTEGER,
        customer_id INTEGER,
        rating INTEGER CHECK(rating >= 1 AND rating <= 5),
        review_text TEXT,
        review_date DATE,
        helpful_count INTEGER DEFAULT 0,
        FOREIGN KEY (product_id) REFERENCES products(product_id),
        FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
    );
''')

# Generate sample data
def generate_sample_data():
    # Sample data generation
    from faker import Faker
    fake = Faker()
    
    # Generate customers
    customers = []
    for _ in range(100):
        customers.append((
            fake.first_name(),
            fake.last_name(),
            fake.email(),
            fake.city(),
            fake.country(),
            fake.date_between(start_date='-2y', end_date='today'),
            random.choice(['Regular', 'Premium', 'VIP'])
        ))
    
    cursor.executemany('''
        INSERT OR IGNORE INTO customers 
        (first_name, last_name, email, city, country, registration_date, customer_segment)
        VALUES (?, ?, ?, ?, ?, ?, ?)
    ''', customers)
    
    conn.commit()
    print("Sample database created successfully!")

generate_sample_data()

Understanding Joins

graph LR A[Join Types] --> B[INNER JOIN] A --> C[LEFT JOIN] A --> D[RIGHT JOIN] A --> E[FULL OUTER JOIN] A --> F[CROSS JOIN] A --> G[SELF JOIN] B --> H[Returns matching rows] C --> I[All from left + matches] D --> J[All from right + matches] E --> K[All from both tables] F --> L[Cartesian product] G --> M[Table joined with itself]

INNER JOIN - Finding Matches

# Basic INNER JOIN
query = """
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    c.customer_segment,
    o.order_id,
    o.order_date,
    o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > 100
ORDER BY o.order_date DESC
LIMIT 10;
"""

df = pd.read_sql_query(query, conn)
print(df)

# Multiple table joins
query = """
SELECT 
    c.first_name || ' ' || c.last_name AS customer_name,
    p.product_name,
    od.quantity,
    od.unit_price,
    od.quantity * od.unit_price * (1 - od.discount) AS line_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_details od ON o.order_id = od.order_id
INNER JOIN products p ON od.product_id = p.product_id
WHERE o.status = 'Delivered'
ORDER BY line_total DESC;
"""

df = pd.read_sql_query(query, conn)
print(df.head())

LEFT JOIN - Include All from Left Table

# Find customers who haven't placed orders
query = """
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.email,
    c.registration_date,
    COUNT(o.order_id) AS order_count,
    COALESCE(SUM(o.total_amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.email, c.registration_date
HAVING COUNT(o.order_id) = 0
ORDER BY c.registration_date DESC;
"""

df = pd.read_sql_query(query, conn)
print(f"Customers without orders: {len(df)}")
print(df.head())

# Products that haven't been reviewed
query = """
SELECT 
    p.product_id,
    p.product_name,
    p.category,
    p.price,
    COUNT(r.review_id) AS review_count,
    AVG(r.rating) AS avg_rating
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
WHERE p.discontinued = 0
GROUP BY p.product_id, p.product_name, p.category, p.price
HAVING COUNT(r.review_id) = 0
ORDER BY p.price DESC;
"""

df = pd.read_sql_query(query, conn)
print(df)

FULL OUTER JOIN (Simulated in SQLite)

# SQLite doesn't support FULL OUTER JOIN, so we simulate it with UNION
query = """
-- Get all combinations of customers and products
WITH customer_product_full AS (
    SELECT c.customer_id, p.product_id
    FROM customers c
    LEFT JOIN products p ON 1=1
    
    UNION
    
    SELECT c.customer_id, p.product_id
    FROM products p
    LEFT JOIN customers c ON 1=1
)
SELECT * FROM customer_product_full
LIMIT 10;
"""

# More practical: Combine LEFT and RIGHT JOIN behavior
query = """
SELECT 
    COALESCE(c.customer_id, -1) AS customer_id,
    COALESCE(c.first_name || ' ' || c.last_name, 'No Customer') AS customer_name,
    COALESCE(o.order_id, -1) AS order_id,
    CASE 
        WHEN c.customer_id IS NULL THEN 'Orphaned Order'
        WHEN o.order_id IS NULL THEN 'No Orders'
        ELSE 'Has Orders'
    END AS status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id

UNION

SELECT 
    COALESCE(c.customer_id, -1) AS customer_id,
    COALESCE(c.first_name || ' ' || c.last_name, 'No Customer') AS customer_name,
    o.order_id,
    'Orphaned Order' AS status
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
"""

df = pd.read_sql_query(query, conn)
print(df)

SELF JOIN - Hierarchical Relationships

# Employee hierarchy
query = """
SELECT 
    e1.employee_id,
    e1.first_name || ' ' || e1.last_name AS employee_name,
    e1.department,
    e1.salary,
    e2.first_name || ' ' || e2.last_name AS manager_name,
    e2.salary AS manager_salary,
    e1.salary - e2.salary AS salary_difference
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.employee_id
ORDER BY e1.department, e1.salary DESC;
"""

df = pd.read_sql_query(query, conn)
print(df)

# Find products in the same category with similar prices
query = """
SELECT 
    p1.product_name AS product_1,
    p1.price AS price_1,
    p2.product_name AS product_2,
    p2.price AS price_2,
    ABS(p1.price - p2.price) AS price_difference
FROM products p1
INNER JOIN products p2 
    ON p1.category = p2.category 
    AND p1.product_id < p2.product_id
    AND ABS(p1.price - p2.price) < 10
WHERE p1.discontinued = 0 AND p2.discontinued = 0
ORDER BY price_difference
LIMIT 20;
"""

df = pd.read_sql_query(query, conn)
print(df)

Subqueries

Scalar Subqueries

# Subquery in SELECT clause
query = """
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.customer_segment,
    (
        SELECT COUNT(*)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS order_count,
    (
        SELECT SUM(total_amount)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS total_spent,
    (
        SELECT MAX(order_date)
        FROM orders o
        WHERE o.customer_id = c.customer_id
    ) AS last_order_date
FROM customers c
ORDER BY total_spent DESC NULLS LAST
LIMIT 10;
"""

df = pd.read_sql_query(query, conn)
print(df)

Correlated Subqueries

# Find products priced above their category average
query = """
SELECT 
    p1.product_id,
    p1.product_name,
    p1.category,
    p1.price,
    (
        SELECT AVG(p2.price)
        FROM products p2
        WHERE p2.category = p1.category
    ) AS category_avg_price,
    p1.price - (
        SELECT AVG(p2.price)
        FROM products p2
        WHERE p2.category = p1.category
    ) AS price_above_avg
FROM products p1
WHERE p1.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category = p1.category
)
ORDER BY price_above_avg DESC;
"""

df = pd.read_sql_query(query, conn)
print(df.head())

# Find customers whose orders are all above average
query = """
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    COUNT(o.order_id) AS order_count,
    AVG(o.total_amount) AS avg_order_value
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE NOT EXISTS (
    SELECT 1
    FROM orders o2
    WHERE o2.customer_id = c.customer_id
    AND o2.total_amount < (
        SELECT AVG(total_amount)
        FROM orders
    )
)
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING COUNT(o.order_id) > 2
ORDER BY avg_order_value DESC;
"""

df = pd.read_sql_query(query, conn)
print(df)

Subqueries in FROM Clause

# Derived tables
query = """
SELECT 
    customer_segment,
    AVG(order_count) AS avg_orders_per_customer,
    AVG(total_spent) AS avg_spent_per_customer,
    MAX(total_spent) AS max_spent,
    MIN(total_spent) AS min_spent
FROM (
    SELECT 
        c.customer_segment,
        c.customer_id,
        COUNT(o.order_id) AS order_count,
        COALESCE(SUM(o.total_amount), 0) AS total_spent
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_segment, c.customer_id
) AS customer_summary
GROUP BY customer_segment
ORDER BY avg_spent_per_customer DESC;
"""

df = pd.read_sql_query(query, conn)
print(df)

Common Table Expressions (CTEs)

Basic CTEs

# Single CTE
query = """
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        c.customer_segment,
        COUNT(DISTINCT o.order_id) AS order_count,
        SUM(o.total_amount) AS total_spent,
        AVG(o.total_amount) AS avg_order_value,
        MAX(o.order_date) AS last_order_date
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    GROUP BY c.customer_id, c.first_name, c.last_name, c.customer_segment
)
SELECT 
    customer_segment,
    COUNT(*) AS customer_count,
    AVG(order_count) AS avg_orders,
    AVG(total_spent) AS avg_total_spent,
    AVG(avg_order_value) AS avg_order_value
FROM customer_metrics
GROUP BY customer_segment
ORDER BY avg_total_spent DESC;
"""

df = pd.read_sql_query(query, conn)
print(df)

Multiple CTEs

# Chain multiple CTEs for complex analysis
query = """
WITH 
-- Calculate product performance
product_performance AS (
    SELECT 
        p.product_id,
        p.product_name,
        p.category,
        p.price,
        COUNT(DISTINCT od.order_id) AS times_ordered,
        SUM(od.quantity) AS total_quantity_sold,
        SUM(od.quantity * od.unit_price * (1 - od.discount)) AS total_revenue
    FROM products p
    LEFT JOIN order_details od ON p.product_id = od.product_id
    GROUP BY p.product_id, p.product_name, p.category, p.price
),
-- Calculate category rankings
category_rankings AS (
    SELECT 
        product_id,
        product_name,
        category,
        total_revenue,
        ROW_NUMBER() OVER (PARTITION BY category ORDER BY total_revenue DESC) AS category_rank,
        DENSE_RANK() OVER (ORDER BY total_revenue DESC) AS overall_rank
    FROM product_performance
    WHERE total_revenue > 0
),
-- Get top performers
top_performers AS (
    SELECT *
    FROM category_rankings
    WHERE category_rank <= 3
)
SELECT 
    tp.*,
    pp.times_ordered,
    pp.total_quantity_sold,
    ROUND(pp.total_revenue / pp.total_quantity_sold, 2) AS avg_selling_price
FROM top_performers tp
JOIN product_performance pp ON tp.product_id = pp.product_id
ORDER BY tp.category, tp.category_rank;
"""

df = pd.read_sql_query(query, conn)
print(df)

Recursive CTEs

# Create a category hierarchy table
cursor.execute('''
    CREATE TABLE IF NOT EXISTS category_hierarchy (
        category_id INTEGER PRIMARY KEY,
        category_name TEXT,
        parent_category_id INTEGER,
        FOREIGN KEY (parent_category_id) REFERENCES category_hierarchy(category_id)
    );
''')

# Recursive CTE for hierarchy traversal
query = """
WITH RECURSIVE category_tree AS (
    -- Anchor: Start with root categories
    SELECT 
        category_id,
        category_name,
        parent_category_id,
        0 AS level,
        category_name AS path
    FROM category_hierarchy
    WHERE parent_category_id IS NULL
    
    UNION ALL
    
    -- Recursive: Add child categories
    SELECT 
        ch.category_id,
        ch.category_name,
        ch.parent_category_id,
        ct.level + 1,
        ct.path || ' > ' || ch.category_name
    FROM category_hierarchy ch
    INNER JOIN category_tree ct ON ch.parent_category_id = ct.category_id
    WHERE ct.level < 10  -- Prevent infinite recursion
)
SELECT 
    category_id,
    category_name,
    level,
    path
FROM category_tree
ORDER BY path;
"""

# Generate numbers sequence using recursive CTE
query = """
WITH RECURSIVE numbers(n) AS (
    SELECT 1
    UNION ALL
    SELECT n + 1
    FROM numbers
    WHERE n < 100
)
SELECT 
    n,
    n * n AS square,
    n * n * n AS cube
FROM numbers
WHERE n <= 10;
"""

df = pd.read_sql_query(query, conn)
print(df)

Advanced Join Techniques

Non-Equi Joins

# Date range joins
query = """
-- Find overlapping promotions
WITH promotions AS (
    SELECT 
        1 AS promo_id, 
        'Summer Sale' AS promo_name,
        DATE('2024-06-01') AS start_date,
        DATE('2024-08-31') AS end_date
    UNION ALL
    SELECT 2, 'Back to School', DATE('2024-08-15'), DATE('2024-09-15')
    UNION ALL
    SELECT 3, 'Fall Special', DATE('2024-09-01'), DATE('2024-11-30')
)
SELECT 
    p1.promo_name AS promo_1,
    p1.start_date AS promo_1_start,
    p1.end_date AS promo_1_end,
    p2.promo_name AS promo_2,
    p2.start_date AS promo_2_start,
    p2.end_date AS promo_2_end,
    -- Calculate overlap days
    MIN(p1.end_date, p2.end_date) AS overlap_end,
    MAX(p1.start_date, p2.start_date) AS overlap_start
FROM promotions p1
INNER JOIN promotions p2 
    ON p1.promo_id < p2.promo_id
    AND p1.start_date <= p2.end_date
    AND p2.start_date <= p1.end_date;
"""

df = pd.read_sql_query(query, conn)
print(df)

Complex Join Conditions

# Multi-condition joins with CASE
query = """
SELECT 
    c.customer_id,
    c.first_name || ' ' || c.last_name AS customer_name,
    c.customer_segment,
    o.order_id,
    o.total_amount,
    CASE 
        WHEN c.customer_segment = 'VIP' AND o.total_amount > 500 THEN 0.20
        WHEN c.customer_segment = 'Premium' AND o.total_amount > 300 THEN 0.15
        WHEN c.customer_segment = 'Regular' AND o.total_amount > 100 THEN 0.10
        ELSE 0
    END AS discount_rate,
    o.total_amount * (1 - CASE 
        WHEN c.customer_segment = 'VIP' AND o.total_amount > 500 THEN 0.20
        WHEN c.customer_segment = 'Premium' AND o.total_amount > 300 THEN 0.15
        WHEN c.customer_segment = 'Regular' AND o.total_amount > 100 THEN 0.10
        ELSE 0
    END) AS final_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status != 'Cancelled'
ORDER BY final_amount DESC
LIMIT 20;
"""

df = pd.read_sql_query(query, conn)
print(df)

Set Operations

UNION, INTERSECT, and EXCEPT

# UNION - Combine customer and supplier contacts
query = """
SELECT 
    'Customer' AS contact_type,
    first_name || ' ' || last_name AS contact_name,
    email AS contact_info,
    city,
    country
FROM customers
WHERE country = 'USA'

UNION

SELECT 
    'Supplier' AS contact_type,
    contact_name,
    supplier_name AS contact_info,
    NULL AS city,
    country
FROM suppliers
WHERE country = 'USA'

ORDER BY contact_type, contact_name;
"""

df = pd.read_sql_query(query, conn)
print(df.head())

# INTERSECT - Find common elements
query = """
-- Customers who are also employees (by name)
SELECT first_name, last_name
FROM customers

INTERSECT

SELECT first_name, last_name
FROM employees;
"""

df = pd.read_sql_query(query, conn)
print(df)

# EXCEPT - Find differences
query = """
-- Products never ordered
SELECT product_id
FROM products
WHERE discontinued = 0

EXCEPT

SELECT DISTINCT product_id
FROM order_details;
"""

df = pd.read_sql_query(query, conn)
print(df)

Advanced Aggregations

GROUPING SETS, ROLLUP, and CUBE

# Simulate GROUPING SETS with UNION
query = """
-- Multiple grouping levels in one query
SELECT 
    category,
    NULL AS subcategory,
    COUNT(*) AS product_count,
    AVG(price) AS avg_price,
    'Category Level' AS grouping_level
FROM products
GROUP BY category

UNION ALL

SELECT 
    category,
    subcategory,
    COUNT(*) AS product_count,
    AVG(price) AS avg_price,
    'Subcategory Level' AS grouping_level
FROM products
GROUP BY category, subcategory

UNION ALL

SELECT 
    NULL AS category,
    NULL AS subcategory,
    COUNT(*) AS product_count,
    AVG(price) AS avg_price,
    'Grand Total' AS grouping_level
FROM products

ORDER BY 
    CASE WHEN category IS NULL THEN 1 ELSE 0 END,
    category,
    CASE WHEN subcategory IS NULL THEN 1 ELSE 0 END,
    subcategory;
"""

df = pd.read_sql_query(query, conn)
print(df)

Query Optimization Techniques

Using EXPLAIN QUERY PLAN

# Analyze query execution plan
def analyze_query(query):
    explain_query = f"EXPLAIN QUERY PLAN {query}"
    df = pd.read_sql_query(explain_query, conn)
    print("Query Execution Plan:")
    print(df)
    print("\nActual Query Results:")
    return pd.read_sql_query(query, conn)

# Example: Comparing join strategies
query1 = """
SELECT c.*, o.*
FROM customers c, orders o
WHERE c.customer_id = o.customer_id
AND c.country = 'USA'
"""

query2 = """
SELECT c.*, o.*
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
"""

print("Query 1 - Old style join:")
df1 = analyze_query(query1)
print("\nQuery 2 - Modern join:")
df2 = analyze_query(query2)

Index Usage

# Create indexes for better performance
cursor.executescript('''
    -- Create indexes for foreign keys
    CREATE INDEX IF NOT EXISTS idx_orders_customer_id 
    ON orders(customer_id);
    
    CREATE INDEX IF NOT EXISTS idx_order_details_order_id 
    ON order_details(order_id);
    
    CREATE INDEX IF NOT EXISTS idx_order_details_product_id 
    ON order_details(product_id);
    
    -- Create composite indexes for common queries
    CREATE INDEX IF NOT EXISTS idx_orders_status_date 
    ON orders(status, order_date);
    
    CREATE INDEX IF NOT EXISTS idx_products_category_price 
    ON products(category, price);
    
    -- Create covering index
    CREATE INDEX IF NOT EXISTS idx_customers_covering 
    ON customers(customer_id, first_name, last_name, email);
''')

# Query using indexes
query = """
SELECT 
    c.first_name,
    c.last_name,
    c.email,
    COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'Delivered'
    AND o.order_date >= DATE('now', '-30 days')
GROUP BY c.customer_id, c.first_name, c.last_name, c.email
HAVING COUNT(o.order_id) > 0
ORDER BY order_count DESC;
"""

df = analyze_query(query)
print(df.head())

Real-world Complex Query Examples

Customer Lifetime Value Analysis

query = """
WITH customer_metrics AS (
    SELECT 
        c.customer_id,
        c.first_name || ' ' || c.last_name AS customer_name,
        c.customer_segment,
        c.registration_date,
        COUNT(DISTINCT o.order_id) AS total_orders,
        SUM(o.total_amount) AS total_spent,
        AVG(o.total_amount) AS avg_order_value,
        MIN(o.order_date) AS first_order_date,
        MAX(o.order_date) AS last_order_date,
        JULIANDAY(MAX(o.order_date)) - JULIANDAY(MIN(o.order_date)) AS customer_lifespan_days,
        JULIANDAY('now') - JULIANDAY(MAX(o.order_date)) AS days_since_last_order
    FROM customers c
    LEFT JOIN orders o ON c.customer_id = o.customer_id
    WHERE o.status != 'Cancelled'
    GROUP BY c.customer_id, c.first_name, c.last_name, c.customer_segment, c.registration_date
),
customer_segments AS (
    SELECT 
        *,
        CASE 
            WHEN days_since_last_order <= 30 THEN 'Active'
            WHEN days_since_last_order <= 90 THEN 'At Risk'
            WHEN days_since_last_order <= 180 THEN 'Dormant'
            ELSE 'Lost'
        END AS activity_status,
        CASE 
            WHEN total_spent > 10000 THEN 'High Value'
            WHEN total_spent > 5000 THEN 'Medium Value'
            WHEN total_spent > 1000 THEN 'Low Value'
            ELSE 'Minimal Value'
        END AS value_segment,
        CASE 
            WHEN customer_lifespan_days = 0 THEN 0
            ELSE total_spent / (customer_lifespan_days / 365.0)
        END AS annual_value
    FROM customer_metrics
)
SELECT 
    activity_status,
    value_segment,
    COUNT(*) AS customer_count,
    AVG(total_orders) AS avg_orders,
    AVG(total_spent) AS avg_lifetime_value,
    AVG(annual_value) AS avg_annual_value,
    SUM(total_spent) AS total_revenue
FROM customer_segments
GROUP BY activity_status, value_segment
ORDER BY activity_status, total_revenue DESC;
"""

df = pd.read_sql_query(query, conn)
print(df)

Market Basket Analysis

query = """
WITH product_pairs AS (
    SELECT 
        od1.product_id AS product_1,
        od2.product_id AS product_2,
        COUNT(DISTINCT od1.order_id) AS times_bought_together
    FROM order_details od1
    INNER JOIN order_details od2 
        ON od1.order_id = od2.order_id
        AND od1.product_id < od2.product_id
    GROUP BY od1.product_id, od2.product_id
    HAVING COUNT(DISTINCT od1.order_id) >= 5
),
product_stats AS (
    SELECT 
        product_id,
        COUNT(DISTINCT order_id) AS total_orders
    FROM order_details
    GROUP BY product_id
)
SELECT 
    p1.product_name AS product_1_name,
    p2.product_name AS product_2_name,
    pp.times_bought_together,
    ps1.total_orders AS product_1_orders,
    ps2.total_orders AS product_2_orders,
    ROUND(
        CAST(pp.times_bought_together AS FLOAT) / ps1.total_orders * 100, 
        2
    ) AS confidence_1_to_2,
    ROUND(
        CAST(pp.times_bought_together AS FLOAT) / ps2.total_orders * 100, 
        2
    ) AS confidence_2_to_1,
    ROUND(
        CAST(pp.times_bought_together AS FLOAT) / 
        (ps1.total_orders + ps2.total_orders - pp.times_bought_together) * 100,
        2
    ) AS lift
FROM product_pairs pp
JOIN products p1 ON pp.product_1 = p1.product_id
JOIN products p2 ON pp.product_2 = p2.product_id
JOIN product_stats ps1 ON pp.product_1 = ps1.product_id
JOIN product_stats ps2 ON pp.product_2 = ps2.product_id
ORDER BY pp.times_bought_together DESC
LIMIT 20;
"""

df = pd.read_sql_query(query, conn)
print(df)

Practice Exercises

Exercise 1: Sales Performance Dashboard

Create a comprehensive query that shows:

  1. Monthly sales trends with year-over-year comparison
  2. Top performing products by category
  3. Customer segment analysis
  4. Geographic sales distribution
  5. All in a single query using CTEs

Exercise 2: Customer Churn Analysis

Build a query to identify:

  1. Customers likely to churn (no orders in 90 days)
  2. Their historical purchase patterns
  3. Common characteristics of churned customers
  4. Recommendations for retention

Exercise 3: Inventory Optimization

Design a query that:

  1. Identifies slow-moving products
  2. Calculates optimal reorder points
  3. Suggests bundle opportunities
  4. Forecasts demand based on historical data

Key Takeaways

Further Resources