Complex Queries and Joins
Master Advanced SQL for Data Analysis! 🎯
Complex SQL queries are the backbone of data analysis. From multi-table joins to recursive CTEs, mastering these techniques enables you to extract insights from relational databases efficiently. Learn to write queries that would make any database administrator proud!
Setting Up Our Practice Database
Let's create a comprehensive e-commerce database for our examples:
import sqlite3
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random
# Create connection
conn = sqlite3.connect('ecommerce.db')
cursor = conn.cursor()
# Create tables
cursor.executescript('''
-- Customers table
CREATE TABLE IF NOT EXISTS customers (
customer_id INTEGER PRIMARY KEY AUTOINCREMENT,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
email TEXT UNIQUE NOT NULL,
city TEXT,
country TEXT,
registration_date DATE,
customer_segment TEXT CHECK(customer_segment IN ('Regular', 'Premium', 'VIP'))
);
-- Products table
CREATE TABLE IF NOT EXISTS products (
product_id INTEGER PRIMARY KEY AUTOINCREMENT,
product_name TEXT NOT NULL,
category TEXT,
subcategory TEXT,
price DECIMAL(10,2),
cost DECIMAL(10,2),
supplier_id INTEGER,
stock_quantity INTEGER DEFAULT 0,
discontinued BOOLEAN DEFAULT FALSE
);
-- Orders table
CREATE TABLE IF NOT EXISTS orders (
order_id INTEGER PRIMARY KEY AUTOINCREMENT,
customer_id INTEGER,
order_date DATETIME,
ship_date DATETIME,
status TEXT CHECK(status IN ('Pending', 'Processing', 'Shipped', 'Delivered', 'Cancelled')),
total_amount DECIMAL(10,2),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
-- Order details table
CREATE TABLE IF NOT EXISTS order_details (
order_detail_id INTEGER PRIMARY KEY AUTOINCREMENT,
order_id INTEGER,
product_id INTEGER,
quantity INTEGER,
unit_price DECIMAL(10,2),
discount DECIMAL(3,2) DEFAULT 0,
FOREIGN KEY (order_id) REFERENCES orders(order_id),
FOREIGN KEY (product_id) REFERENCES products(product_id)
);
-- Suppliers table
CREATE TABLE IF NOT EXISTS suppliers (
supplier_id INTEGER PRIMARY KEY AUTOINCREMENT,
supplier_name TEXT NOT NULL,
contact_name TEXT,
country TEXT,
rating INTEGER CHECK(rating >= 1 AND rating <= 5)
);
-- Employee table
CREATE TABLE IF NOT EXISTS employees (
employee_id INTEGER PRIMARY KEY AUTOINCREMENT,
first_name TEXT NOT NULL,
last_name TEXT NOT NULL,
department TEXT,
hire_date DATE,
salary DECIMAL(10,2),
manager_id INTEGER,
FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
);
-- Reviews table
CREATE TABLE IF NOT EXISTS reviews (
review_id INTEGER PRIMARY KEY AUTOINCREMENT,
product_id INTEGER,
customer_id INTEGER,
rating INTEGER CHECK(rating >= 1 AND rating <= 5),
review_text TEXT,
review_date DATE,
helpful_count INTEGER DEFAULT 0,
FOREIGN KEY (product_id) REFERENCES products(product_id),
FOREIGN KEY (customer_id) REFERENCES customers(customer_id)
);
''')
# Generate sample data
def generate_sample_data():
# Sample data generation
from faker import Faker
fake = Faker()
# Generate customers
customers = []
for _ in range(100):
customers.append((
fake.first_name(),
fake.last_name(),
fake.email(),
fake.city(),
fake.country(),
fake.date_between(start_date='-2y', end_date='today'),
random.choice(['Regular', 'Premium', 'VIP'])
))
cursor.executemany('''
INSERT OR IGNORE INTO customers
(first_name, last_name, email, city, country, registration_date, customer_segment)
VALUES (?, ?, ?, ?, ?, ?, ?)
''', customers)
conn.commit()
print("Sample database created successfully!")
generate_sample_data()
Understanding Joins
graph LR
A[Join Types] --> B[INNER JOIN]
A --> C[LEFT JOIN]
A --> D[RIGHT JOIN]
A --> E[FULL OUTER JOIN]
A --> F[CROSS JOIN]
A --> G[SELF JOIN]
B --> H[Returns matching rows]
C --> I[All from left + matches]
D --> J[All from right + matches]
E --> K[All from both tables]
F --> L[Cartesian product]
G --> M[Table joined with itself]
INNER JOIN - Finding Matches
# Basic INNER JOIN
query = """
SELECT
c.first_name || ' ' || c.last_name AS customer_name,
c.customer_segment,
o.order_id,
o.order_date,
o.total_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.total_amount > 100
ORDER BY o.order_date DESC
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
print(df)
# Multiple table joins
query = """
SELECT
c.first_name || ' ' || c.last_name AS customer_name,
p.product_name,
od.quantity,
od.unit_price,
od.quantity * od.unit_price * (1 - od.discount) AS line_total
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN order_details od ON o.order_id = od.order_id
INNER JOIN products p ON od.product_id = p.product_id
WHERE o.status = 'Delivered'
ORDER BY line_total DESC;
"""
df = pd.read_sql_query(query, conn)
print(df.head())
LEFT JOIN - Include All from Left Table
# Find customers who haven't placed orders
query = """
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
c.email,
c.registration_date,
COUNT(o.order_id) AS order_count,
COALESCE(SUM(o.total_amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.email, c.registration_date
HAVING COUNT(o.order_id) = 0
ORDER BY c.registration_date DESC;
"""
df = pd.read_sql_query(query, conn)
print(f"Customers without orders: {len(df)}")
print(df.head())
# Products that haven't been reviewed
query = """
SELECT
p.product_id,
p.product_name,
p.category,
p.price,
COUNT(r.review_id) AS review_count,
AVG(r.rating) AS avg_rating
FROM products p
LEFT JOIN reviews r ON p.product_id = r.product_id
WHERE p.discontinued = 0
GROUP BY p.product_id, p.product_name, p.category, p.price
HAVING COUNT(r.review_id) = 0
ORDER BY p.price DESC;
"""
df = pd.read_sql_query(query, conn)
print(df)
FULL OUTER JOIN (Simulated in SQLite)
# SQLite doesn't support FULL OUTER JOIN, so we simulate it with UNION
query = """
-- Get all combinations of customers and products
WITH customer_product_full AS (
SELECT c.customer_id, p.product_id
FROM customers c
LEFT JOIN products p ON 1=1
UNION
SELECT c.customer_id, p.product_id
FROM products p
LEFT JOIN customers c ON 1=1
)
SELECT * FROM customer_product_full
LIMIT 10;
"""
# More practical: Combine LEFT and RIGHT JOIN behavior
query = """
SELECT
COALESCE(c.customer_id, -1) AS customer_id,
COALESCE(c.first_name || ' ' || c.last_name, 'No Customer') AS customer_name,
COALESCE(o.order_id, -1) AS order_id,
CASE
WHEN c.customer_id IS NULL THEN 'Orphaned Order'
WHEN o.order_id IS NULL THEN 'No Orders'
ELSE 'Has Orders'
END AS status
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
UNION
SELECT
COALESCE(c.customer_id, -1) AS customer_id,
COALESCE(c.first_name || ' ' || c.last_name, 'No Customer') AS customer_name,
o.order_id,
'Orphaned Order' AS status
FROM orders o
LEFT JOIN customers c ON o.customer_id = c.customer_id
WHERE c.customer_id IS NULL;
"""
df = pd.read_sql_query(query, conn)
print(df)
SELF JOIN - Hierarchical Relationships
# Employee hierarchy
query = """
SELECT
e1.employee_id,
e1.first_name || ' ' || e1.last_name AS employee_name,
e1.department,
e1.salary,
e2.first_name || ' ' || e2.last_name AS manager_name,
e2.salary AS manager_salary,
e1.salary - e2.salary AS salary_difference
FROM employees e1
LEFT JOIN employees e2 ON e1.manager_id = e2.employee_id
ORDER BY e1.department, e1.salary DESC;
"""
df = pd.read_sql_query(query, conn)
print(df)
# Find products in the same category with similar prices
query = """
SELECT
p1.product_name AS product_1,
p1.price AS price_1,
p2.product_name AS product_2,
p2.price AS price_2,
ABS(p1.price - p2.price) AS price_difference
FROM products p1
INNER JOIN products p2
ON p1.category = p2.category
AND p1.product_id < p2.product_id
AND ABS(p1.price - p2.price) < 10
WHERE p1.discontinued = 0 AND p2.discontinued = 0
ORDER BY price_difference
LIMIT 20;
"""
df = pd.read_sql_query(query, conn)
print(df)
Subqueries
Scalar Subqueries
# Subquery in SELECT clause
query = """
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
c.customer_segment,
(
SELECT COUNT(*)
FROM orders o
WHERE o.customer_id = c.customer_id
) AS order_count,
(
SELECT SUM(total_amount)
FROM orders o
WHERE o.customer_id = c.customer_id
) AS total_spent,
(
SELECT MAX(order_date)
FROM orders o
WHERE o.customer_id = c.customer_id
) AS last_order_date
FROM customers c
ORDER BY total_spent DESC NULLS LAST
LIMIT 10;
"""
df = pd.read_sql_query(query, conn)
print(df)
Correlated Subqueries
# Find products priced above their category average
query = """
SELECT
p1.product_id,
p1.product_name,
p1.category,
p1.price,
(
SELECT AVG(p2.price)
FROM products p2
WHERE p2.category = p1.category
) AS category_avg_price,
p1.price - (
SELECT AVG(p2.price)
FROM products p2
WHERE p2.category = p1.category
) AS price_above_avg
FROM products p1
WHERE p1.price > (
SELECT AVG(p2.price)
FROM products p2
WHERE p2.category = p1.category
)
ORDER BY price_above_avg DESC;
"""
df = pd.read_sql_query(query, conn)
print(df.head())
# Find customers whose orders are all above average
query = """
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
COUNT(o.order_id) AS order_count,
AVG(o.total_amount) AS avg_order_value
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE NOT EXISTS (
SELECT 1
FROM orders o2
WHERE o2.customer_id = c.customer_id
AND o2.total_amount < (
SELECT AVG(total_amount)
FROM orders
)
)
GROUP BY c.customer_id, c.first_name, c.last_name
HAVING COUNT(o.order_id) > 2
ORDER BY avg_order_value DESC;
"""
df = pd.read_sql_query(query, conn)
print(df)
Subqueries in FROM Clause
# Derived tables
query = """
SELECT
customer_segment,
AVG(order_count) AS avg_orders_per_customer,
AVG(total_spent) AS avg_spent_per_customer,
MAX(total_spent) AS max_spent,
MIN(total_spent) AS min_spent
FROM (
SELECT
c.customer_segment,
c.customer_id,
COUNT(o.order_id) AS order_count,
COALESCE(SUM(o.total_amount), 0) AS total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_segment, c.customer_id
) AS customer_summary
GROUP BY customer_segment
ORDER BY avg_spent_per_customer DESC;
"""
df = pd.read_sql_query(query, conn)
print(df)
Common Table Expressions (CTEs)
Basic CTEs
# Single CTE
query = """
WITH customer_metrics AS (
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
c.customer_segment,
COUNT(DISTINCT o.order_id) AS order_count,
SUM(o.total_amount) AS total_spent,
AVG(o.total_amount) AS avg_order_value,
MAX(o.order_date) AS last_order_date
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.first_name, c.last_name, c.customer_segment
)
SELECT
customer_segment,
COUNT(*) AS customer_count,
AVG(order_count) AS avg_orders,
AVG(total_spent) AS avg_total_spent,
AVG(avg_order_value) AS avg_order_value
FROM customer_metrics
GROUP BY customer_segment
ORDER BY avg_total_spent DESC;
"""
df = pd.read_sql_query(query, conn)
print(df)
Multiple CTEs
# Chain multiple CTEs for complex analysis
query = """
WITH
-- Calculate product performance
product_performance AS (
SELECT
p.product_id,
p.product_name,
p.category,
p.price,
COUNT(DISTINCT od.order_id) AS times_ordered,
SUM(od.quantity) AS total_quantity_sold,
SUM(od.quantity * od.unit_price * (1 - od.discount)) AS total_revenue
FROM products p
LEFT JOIN order_details od ON p.product_id = od.product_id
GROUP BY p.product_id, p.product_name, p.category, p.price
),
-- Calculate category rankings
category_rankings AS (
SELECT
product_id,
product_name,
category,
total_revenue,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY total_revenue DESC) AS category_rank,
DENSE_RANK() OVER (ORDER BY total_revenue DESC) AS overall_rank
FROM product_performance
WHERE total_revenue > 0
),
-- Get top performers
top_performers AS (
SELECT *
FROM category_rankings
WHERE category_rank <= 3
)
SELECT
tp.*,
pp.times_ordered,
pp.total_quantity_sold,
ROUND(pp.total_revenue / pp.total_quantity_sold, 2) AS avg_selling_price
FROM top_performers tp
JOIN product_performance pp ON tp.product_id = pp.product_id
ORDER BY tp.category, tp.category_rank;
"""
df = pd.read_sql_query(query, conn)
print(df)
Recursive CTEs
# Create a category hierarchy table
cursor.execute('''
CREATE TABLE IF NOT EXISTS category_hierarchy (
category_id INTEGER PRIMARY KEY,
category_name TEXT,
parent_category_id INTEGER,
FOREIGN KEY (parent_category_id) REFERENCES category_hierarchy(category_id)
);
''')
# Recursive CTE for hierarchy traversal
query = """
WITH RECURSIVE category_tree AS (
-- Anchor: Start with root categories
SELECT
category_id,
category_name,
parent_category_id,
0 AS level,
category_name AS path
FROM category_hierarchy
WHERE parent_category_id IS NULL
UNION ALL
-- Recursive: Add child categories
SELECT
ch.category_id,
ch.category_name,
ch.parent_category_id,
ct.level + 1,
ct.path || ' > ' || ch.category_name
FROM category_hierarchy ch
INNER JOIN category_tree ct ON ch.parent_category_id = ct.category_id
WHERE ct.level < 10 -- Prevent infinite recursion
)
SELECT
category_id,
category_name,
level,
path
FROM category_tree
ORDER BY path;
"""
# Generate numbers sequence using recursive CTE
query = """
WITH RECURSIVE numbers(n) AS (
SELECT 1
UNION ALL
SELECT n + 1
FROM numbers
WHERE n < 100
)
SELECT
n,
n * n AS square,
n * n * n AS cube
FROM numbers
WHERE n <= 10;
"""
df = pd.read_sql_query(query, conn)
print(df)
Advanced Join Techniques
Non-Equi Joins
# Date range joins
query = """
-- Find overlapping promotions
WITH promotions AS (
SELECT
1 AS promo_id,
'Summer Sale' AS promo_name,
DATE('2024-06-01') AS start_date,
DATE('2024-08-31') AS end_date
UNION ALL
SELECT 2, 'Back to School', DATE('2024-08-15'), DATE('2024-09-15')
UNION ALL
SELECT 3, 'Fall Special', DATE('2024-09-01'), DATE('2024-11-30')
)
SELECT
p1.promo_name AS promo_1,
p1.start_date AS promo_1_start,
p1.end_date AS promo_1_end,
p2.promo_name AS promo_2,
p2.start_date AS promo_2_start,
p2.end_date AS promo_2_end,
-- Calculate overlap days
MIN(p1.end_date, p2.end_date) AS overlap_end,
MAX(p1.start_date, p2.start_date) AS overlap_start
FROM promotions p1
INNER JOIN promotions p2
ON p1.promo_id < p2.promo_id
AND p1.start_date <= p2.end_date
AND p2.start_date <= p1.end_date;
"""
df = pd.read_sql_query(query, conn)
print(df)
Complex Join Conditions
# Multi-condition joins with CASE
query = """
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
c.customer_segment,
o.order_id,
o.total_amount,
CASE
WHEN c.customer_segment = 'VIP' AND o.total_amount > 500 THEN 0.20
WHEN c.customer_segment = 'Premium' AND o.total_amount > 300 THEN 0.15
WHEN c.customer_segment = 'Regular' AND o.total_amount > 100 THEN 0.10
ELSE 0
END AS discount_rate,
o.total_amount * (1 - CASE
WHEN c.customer_segment = 'VIP' AND o.total_amount > 500 THEN 0.20
WHEN c.customer_segment = 'Premium' AND o.total_amount > 300 THEN 0.15
WHEN c.customer_segment = 'Regular' AND o.total_amount > 100 THEN 0.10
ELSE 0
END) AS final_amount
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status != 'Cancelled'
ORDER BY final_amount DESC
LIMIT 20;
"""
df = pd.read_sql_query(query, conn)
print(df)
Set Operations
UNION, INTERSECT, and EXCEPT
# UNION - Combine customer and supplier contacts
query = """
SELECT
'Customer' AS contact_type,
first_name || ' ' || last_name AS contact_name,
email AS contact_info,
city,
country
FROM customers
WHERE country = 'USA'
UNION
SELECT
'Supplier' AS contact_type,
contact_name,
supplier_name AS contact_info,
NULL AS city,
country
FROM suppliers
WHERE country = 'USA'
ORDER BY contact_type, contact_name;
"""
df = pd.read_sql_query(query, conn)
print(df.head())
# INTERSECT - Find common elements
query = """
-- Customers who are also employees (by name)
SELECT first_name, last_name
FROM customers
INTERSECT
SELECT first_name, last_name
FROM employees;
"""
df = pd.read_sql_query(query, conn)
print(df)
# EXCEPT - Find differences
query = """
-- Products never ordered
SELECT product_id
FROM products
WHERE discontinued = 0
EXCEPT
SELECT DISTINCT product_id
FROM order_details;
"""
df = pd.read_sql_query(query, conn)
print(df)
Advanced Aggregations
GROUPING SETS, ROLLUP, and CUBE
# Simulate GROUPING SETS with UNION
query = """
-- Multiple grouping levels in one query
SELECT
category,
NULL AS subcategory,
COUNT(*) AS product_count,
AVG(price) AS avg_price,
'Category Level' AS grouping_level
FROM products
GROUP BY category
UNION ALL
SELECT
category,
subcategory,
COUNT(*) AS product_count,
AVG(price) AS avg_price,
'Subcategory Level' AS grouping_level
FROM products
GROUP BY category, subcategory
UNION ALL
SELECT
NULL AS category,
NULL AS subcategory,
COUNT(*) AS product_count,
AVG(price) AS avg_price,
'Grand Total' AS grouping_level
FROM products
ORDER BY
CASE WHEN category IS NULL THEN 1 ELSE 0 END,
category,
CASE WHEN subcategory IS NULL THEN 1 ELSE 0 END,
subcategory;
"""
df = pd.read_sql_query(query, conn)
print(df)
Query Optimization Techniques
Using EXPLAIN QUERY PLAN
# Analyze query execution plan
def analyze_query(query):
explain_query = f"EXPLAIN QUERY PLAN {query}"
df = pd.read_sql_query(explain_query, conn)
print("Query Execution Plan:")
print(df)
print("\nActual Query Results:")
return pd.read_sql_query(query, conn)
# Example: Comparing join strategies
query1 = """
SELECT c.*, o.*
FROM customers c, orders o
WHERE c.customer_id = o.customer_id
AND c.country = 'USA'
"""
query2 = """
SELECT c.*, o.*
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE c.country = 'USA'
"""
print("Query 1 - Old style join:")
df1 = analyze_query(query1)
print("\nQuery 2 - Modern join:")
df2 = analyze_query(query2)
Index Usage
# Create indexes for better performance
cursor.executescript('''
-- Create indexes for foreign keys
CREATE INDEX IF NOT EXISTS idx_orders_customer_id
ON orders(customer_id);
CREATE INDEX IF NOT EXISTS idx_order_details_order_id
ON order_details(order_id);
CREATE INDEX IF NOT EXISTS idx_order_details_product_id
ON order_details(product_id);
-- Create composite indexes for common queries
CREATE INDEX IF NOT EXISTS idx_orders_status_date
ON orders(status, order_date);
CREATE INDEX IF NOT EXISTS idx_products_category_price
ON products(category, price);
-- Create covering index
CREATE INDEX IF NOT EXISTS idx_customers_covering
ON customers(customer_id, first_name, last_name, email);
''')
# Query using indexes
query = """
SELECT
c.first_name,
c.last_name,
c.email,
COUNT(o.order_id) AS order_count
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status = 'Delivered'
AND o.order_date >= DATE('now', '-30 days')
GROUP BY c.customer_id, c.first_name, c.last_name, c.email
HAVING COUNT(o.order_id) > 0
ORDER BY order_count DESC;
"""
df = analyze_query(query)
print(df.head())
Real-world Complex Query Examples
Customer Lifetime Value Analysis
query = """
WITH customer_metrics AS (
SELECT
c.customer_id,
c.first_name || ' ' || c.last_name AS customer_name,
c.customer_segment,
c.registration_date,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(o.total_amount) AS total_spent,
AVG(o.total_amount) AS avg_order_value,
MIN(o.order_date) AS first_order_date,
MAX(o.order_date) AS last_order_date,
JULIANDAY(MAX(o.order_date)) - JULIANDAY(MIN(o.order_date)) AS customer_lifespan_days,
JULIANDAY('now') - JULIANDAY(MAX(o.order_date)) AS days_since_last_order
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.status != 'Cancelled'
GROUP BY c.customer_id, c.first_name, c.last_name, c.customer_segment, c.registration_date
),
customer_segments AS (
SELECT
*,
CASE
WHEN days_since_last_order <= 30 THEN 'Active'
WHEN days_since_last_order <= 90 THEN 'At Risk'
WHEN days_since_last_order <= 180 THEN 'Dormant'
ELSE 'Lost'
END AS activity_status,
CASE
WHEN total_spent > 10000 THEN 'High Value'
WHEN total_spent > 5000 THEN 'Medium Value'
WHEN total_spent > 1000 THEN 'Low Value'
ELSE 'Minimal Value'
END AS value_segment,
CASE
WHEN customer_lifespan_days = 0 THEN 0
ELSE total_spent / (customer_lifespan_days / 365.0)
END AS annual_value
FROM customer_metrics
)
SELECT
activity_status,
value_segment,
COUNT(*) AS customer_count,
AVG(total_orders) AS avg_orders,
AVG(total_spent) AS avg_lifetime_value,
AVG(annual_value) AS avg_annual_value,
SUM(total_spent) AS total_revenue
FROM customer_segments
GROUP BY activity_status, value_segment
ORDER BY activity_status, total_revenue DESC;
"""
df = pd.read_sql_query(query, conn)
print(df)
Market Basket Analysis
query = """
WITH product_pairs AS (
SELECT
od1.product_id AS product_1,
od2.product_id AS product_2,
COUNT(DISTINCT od1.order_id) AS times_bought_together
FROM order_details od1
INNER JOIN order_details od2
ON od1.order_id = od2.order_id
AND od1.product_id < od2.product_id
GROUP BY od1.product_id, od2.product_id
HAVING COUNT(DISTINCT od1.order_id) >= 5
),
product_stats AS (
SELECT
product_id,
COUNT(DISTINCT order_id) AS total_orders
FROM order_details
GROUP BY product_id
)
SELECT
p1.product_name AS product_1_name,
p2.product_name AS product_2_name,
pp.times_bought_together,
ps1.total_orders AS product_1_orders,
ps2.total_orders AS product_2_orders,
ROUND(
CAST(pp.times_bought_together AS FLOAT) / ps1.total_orders * 100,
2
) AS confidence_1_to_2,
ROUND(
CAST(pp.times_bought_together AS FLOAT) / ps2.total_orders * 100,
2
) AS confidence_2_to_1,
ROUND(
CAST(pp.times_bought_together AS FLOAT) /
(ps1.total_orders + ps2.total_orders - pp.times_bought_together) * 100,
2
) AS lift
FROM product_pairs pp
JOIN products p1 ON pp.product_1 = p1.product_id
JOIN products p2 ON pp.product_2 = p2.product_id
JOIN product_stats ps1 ON pp.product_1 = ps1.product_id
JOIN product_stats ps2 ON pp.product_2 = ps2.product_id
ORDER BY pp.times_bought_together DESC
LIMIT 20;
"""
df = pd.read_sql_query(query, conn)
print(df)
Practice Exercises
Exercise 1: Sales Performance Dashboard
Create a comprehensive query that shows:
- Monthly sales trends with year-over-year comparison
- Top performing products by category
- Customer segment analysis
- Geographic sales distribution
- All in a single query using CTEs
Exercise 2: Customer Churn Analysis
Build a query to identify:
- Customers likely to churn (no orders in 90 days)
- Their historical purchase patterns
- Common characteristics of churned customers
- Recommendations for retention
Exercise 3: Inventory Optimization
Design a query that:
- Identifies slow-moving products
- Calculates optimal reorder points
- Suggests bundle opportunities
- Forecasts demand based on historical data
Key Takeaways
- 🔗 Master different join types for various data relationships
- 📊 CTEs make complex queries readable and maintainable
- 🎯 Subqueries provide flexibility in data filtering and aggregation
- ⚡ Proper indexing dramatically improves query performance
- 🔄 Recursive CTEs handle hierarchical data elegantly
- 📈 Set operations combine results from multiple queries
- 🧩 Complex business logic can be expressed in SQL