BeautifulSoup for HTML Parsing
Extract Data from Any Website! 🕷️
Web scraping opens up a world of data that isn't available through traditional APIs or datasets. BeautifulSoup makes it easy to parse HTML and XML documents, navigate parse trees, and extract the data you need. From e-commerce prices to news articles, master the art of web scraping to gather data from across the internet!
Understanding Web Scraping
graph TD
A[Web Page] --> B[HTTP Request]
B --> C[HTML Response]
C --> D[Parse HTML]
D --> E[BeautifulSoup]
E --> F[Navigate DOM]
F --> G[Extract Data]
G --> H[Clean & Store]
E --> I[find/find_all]
E --> J[CSS Selectors]
E --> K[Navigation]
I --> L[By Tag]
I --> M[By Class]
I --> N[By ID]
I --> O[By Attributes]
Installation and Setup
# Install required libraries
"""
pip install beautifulsoup4
pip install requests
pip install lxml # Faster parser
pip install html5lib # More lenient parser
pip install pandas # For data storage
"""
# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from urllib.parse import urljoin, urlparse
import time
import json
# Check versions
import bs4
print(f"BeautifulSoup version: {bs4.__version__}")
# Basic setup - create your first soup
html_doc = """
<html>
<head><title>Sample Page</title></head>
<body>
<h1>Welcome to Web Scraping</h1>
<p class="intro">This is a sample paragraph.</p>
<div id="content">
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
</div>
</body>
</html>
"""
# Create BeautifulSoup object
soup = BeautifulSoup(html_doc, 'html.parser')
# Pretty print
print(soup.prettify())
BeautifulSoup Basics
Parsers Comparison
# Different parsers and their characteristics
# 1. html.parser (built-in)
soup = BeautifulSoup(html_doc, 'html.parser')
# Pros: No extra dependencies, decent speed
# Cons: Less lenient with broken HTML
# 2. lxml's HTML parser (recommended)
soup = BeautifulSoup(html_doc, 'lxml')
# Pros: Very fast, lenient
# Cons: Requires C dependency
# 3. lxml's XML parser
soup = BeautifulSoup(html_doc, 'xml')
# Pros: Only currently supported XML parser
# Cons: XML only
# 4. html5lib
soup = BeautifulSoup(html_doc, 'html5lib')
# Pros: Most lenient, creates valid HTML5
# Cons: Slowest
# Performance comparison
import time
parsers = ['html.parser', 'lxml', 'html5lib']
html_large = html_doc * 1000 # Large HTML for testing
for parser in parsers:
try:
start = time.time()
soup = BeautifulSoup(html_large, parser)
end = time.time()
print(f"{parser}: {end - start:.4f} seconds")
except:
print(f"{parser}: Not available")
Navigating the Parse Tree
# Sample HTML for demonstration
html = """
<div class="container">
<header>
<h1>Product Store</h1>
<nav>
<a href="/home">Home</a>
<a href="/products">Products</a>
<a href="/about">About</a>
</nav>
</header>
<main>
<div class="product" data-id="1">
<h2>Laptop</h2>
<p class="price">$999.99</p>
<p class="description">High-performance laptop</p>
<span class="rating">4.5</span>
</div>
<div class="product" data-id="2">
<h2>Phone</h2>
<p class="price">$699.99</p>
<p class="description">Latest smartphone</p>
<span class="rating">4.8</span>
</div>
</main>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
# Accessing tags directly
print(soup.title) # None (no title in this HTML)
print(soup.h1) # First h1 tag
print(soup.h1.string) # Text content
# Navigating using tag names
print(soup.div) # First div
print(soup.div.h2) # First h2 inside first div
# Access tag attributes
product = soup.find('div', class_='product')
print(product['data-id']) # Access attribute
print(product.get('data-id')) # Safe access
print(product.attrs) # All attributes
# Navigate children
container = soup.find('div', class_='container')
for child in container.children:
if child.name: # Skip text nodes
print(f"Child tag: {child.name}")
# Navigate descendants (all levels)
for desc in container.descendants:
if desc.name:
print(f"Descendant: {desc.name}")
# Navigate parents
price = soup.find('p', class_='price')
print(price.parent.name) # div
print(price.find_parent('main').name) # main
# Navigate siblings
first_product = soup.find('div', class_='product')
print(first_product.find_next_sibling('div')['data-id']) # 2
print(first_product.find_previous_sibling()) # None
Finding Elements
find() and find_all()
# find() - returns first match
# find_all() - returns list of all matches
# Find by tag
soup.find('h2') # First h2
soup.find_all('p') # All p tags
# Find by class
soup.find('div', class_='product') # Note: class_ with underscore
soup.find_all('p', class_='price')
# Find by ID
soup.find('div', id='content')
# Find by attributes
soup.find('div', attrs={'data-id': '1'})
soup.find_all('div', attrs={'class': 'product'})
# Find by multiple criteria
soup.find('p', class_='price', string=re.compile(r'\$\d+'))
# Find with custom function
def has_data_attribute(tag):
return tag.has_attr('data-id')
soup.find_all(has_data_attribute)
# Limit results
soup.find_all('div', limit=2)
# Find by text content
soup.find_all(string="Laptop") # Exact match
soup.find_all(string=re.compile("laptop", re.IGNORECASE)) # Regex
# Find parent/children with specific criteria
product = soup.find('div', class_='product')
product.find_all('p') # Search within element
# Practical examples
def extract_products(soup):
products = []
for product in soup.find_all('div', class_='product'):
data = {
'id': product.get('data-id'),
'name': product.find('h2').text.strip(),
'price': product.find('p', class_='price').text.strip(),
'description': product.find('p', class_='description').text.strip(),
'rating': float(product.find('span', class_='rating').text.strip())
}
products.append(data)
return products
products_data = extract_products(soup)
df = pd.DataFrame(products_data)
print(df)
CSS Selectors
# CSS selectors provide powerful and familiar syntax
# Basic selectors
soup.select('p') # All p tags
soup.select('.price') # Class selector
soup.select('#content') # ID selector
soup.select('div.product') # Tag with class
# Descendant selectors
soup.select('div p') # p inside div (any level)
soup.select('div > p') # Direct children only
soup.select('main .product') # class product inside main
# Multiple selectors
soup.select('p.price, p.description') # Either selector
# Attribute selectors
soup.select('[data-id]') # Has attribute
soup.select('[data-id="1"]') # Specific value
soup.select('[class*="pro"]') # Contains string
soup.select('[href^="http"]') # Starts with
soup.select('[href$=".pdf"]') # Ends with
# Pseudo-selectors
soup.select('div:first-child')
soup.select('p:nth-of-type(2)')
soup.select('li:last-child')
# Complex example
# Find all prices in products with rating > 4.5
products = soup.select('.product')
high_rated = []
for product in products:
rating = float(product.select_one('.rating').text)
if rating > 4.5:
price = product.select_one('.price').text
high_rated.append(price)
print(f"Prices of high-rated products: {high_rated}")
# One-liner using list comprehension
high_rated_prices = [
p.select_one('.price').text
for p in soup.select('.product')
if float(p.select_one('.rating').text) > 4.5
]
Real Website Scraping
Scraping a News Website
# Example: Scraping news articles (use responsibly!)
def scrape_news_site(url):
"""Scrape news articles from a website"""
# Set headers to appear as a regular browser
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}
try:
# Make request
response = requests.get(url, headers=headers)
response.raise_for_status() # Raise exception for bad status codes
# Parse HTML
soup = BeautifulSoup(response.content, 'lxml')
# Remove script and style elements
for script in soup(['script', 'style']):
script.decompose()
articles = []
# Example: Find article containers (adjust selectors for specific site)
article_containers = soup.find_all('article') or soup.find_all('div', class_='article')
for article in article_containers:
try:
# Extract data (adjust based on site structure)
data = {
'title': article.find(['h1', 'h2', 'h3']).text.strip(),
'summary': article.find('p').text.strip() if article.find('p') else '',
'link': article.find('a')['href'] if article.find('a') else '',
'date': article.find('time')['datetime'] if article.find('time') else '',
'author': article.find(class_='author').text.strip() if article.find(class_='author') else ''
}
# Clean data
data = {k: v.replace('\n', ' ').strip() for k, v in data.items()}
articles.append(data)
except AttributeError:
continue # Skip if structure doesn't match
return articles
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return []
# Example usage (respect robots.txt!)
# articles = scrape_news_site('https://example-news.com')
# df = pd.DataFrame(articles)
# df.to_csv('news_articles.csv', index=False)
E-commerce Product Scraping
# Scraping product information
def scrape_products(url, max_pages=5):
"""Scrape products from e-commerce site"""
all_products = []
session = requests.Session()
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; DataScienceBot/1.0)'
})
for page in range(1, max_pages + 1):
# Add pagination parameter
params = {'page': page}
try:
response = session.get(url, params=params)
if response.status_code != 200:
break
soup = BeautifulSoup(response.content, 'lxml')
# Find product containers (example selectors)
products = soup.select('.product-item, .product-card, [data-testid="product"]')
if not products:
print(f"No products found on page {page}")
break
for product in products:
try:
# Extract product details
name = product.select_one('.product-name, .title, h3, h4')
price = product.select_one('.price, .product-price, [data-testid="price"]')
rating = product.select_one('.rating, .stars, [aria-label*="rating"]')
image = product.select_one('img')
link = product.select_one('a')
# Extract and clean data
product_data = {
'name': name.text.strip() if name else 'N/A',
'price': extract_price(price.text) if price else None,
'rating': extract_rating(rating) if rating else None,
'image_url': image.get('src', '') if image else '',
'product_url': urljoin(url, link.get('href', '')) if link else '',
'scraped_at': pd.Timestamp.now()
}
all_products.append(product_data)
except Exception as e:
print(f"Error parsing product: {e}")
continue
print(f"Scraped page {page}: {len(products)} products")
# Be respectful - add delay between requests
time.sleep(1)
except Exception as e:
print(f"Error on page {page}: {e}")
break
return all_products
def extract_price(price_text):
"""Extract numerical price from text"""
# Remove currency symbols and extract number
price = re.findall(r'[\d,]+\.?\d*', price_text.replace(',', ''))
return float(price[0]) if price else None
def extract_rating(rating_element):
"""Extract rating from various formats"""
if not rating_element:
return None
# Try different extraction methods
# Method 1: aria-label
if rating_element.get('aria-label'):
rating = re.findall(r'[\d.]+', rating_element['aria-label'])
return float(rating[0]) if rating else None
# Method 2: Text content
rating = re.findall(r'[\d.]+', rating_element.text)
return float(rating[0]) if rating else None
# Usage example
# products = scrape_products('https://example-shop.com/products')
# df = pd.DataFrame(products)
# print(df.head())
# df.to_csv('products.csv', index=False)
Handling Dynamic Content
JavaScript-Rendered Content
# Sometimes content is loaded by JavaScript after page load
# BeautifulSoup alone won't see this content
# Method 1: Find API endpoints
def find_api_data(url):
"""Look for API calls in network tab"""
# First, get the initial HTML
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
# Look for JavaScript variables containing data
scripts = soup.find_all('script')
for script in scripts:
if script.string and 'window.__INITIAL_STATE__' in script.string:
# Extract JSON data
json_text = re.search(r'window\.__INITIAL_STATE__ = ({.*?});',
script.string, re.DOTALL)
if json_text:
data = json.loads(json_text.group(1))
return data
return None
# Method 2: Use requests-html (renders JavaScript)
"""
from requests_html import HTMLSession
session = HTMLSession()
response = session.get(url)
response.html.render() # This will execute JavaScript
soup = BeautifulSoup(response.html.html, 'lxml')
"""
# Method 3: Selenium (covered in separate lesson)
# For complex JavaScript sites, use Selenium
# Method 4: Find XHR/Fetch requests
def find_ajax_endpoints(url):
"""
Use browser developer tools to find AJAX endpoints:
1. Open Network tab
2. Filter by XHR or Fetch
3. Look for JSON responses
4. Use those endpoints directly
"""
pass
Data Extraction Patterns
Tables to DataFrames
# Extract HTML tables directly to pandas
html_table = """
<table id="data-table">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>Alice</td>
<td>30</td>
<td>New York</td>
</tr>
<tr>
<td>Bob</td>
<td>25</td>
<td>Los Angeles</td>
</tr>
</tbody>
</table>
"""
# Method 1: Using pandas read_html
dfs = pd.read_html(html_table)
df = dfs[0] # First table
print(df)
# Method 2: Manual extraction with BeautifulSoup
soup = BeautifulSoup(html_table, 'lxml')
table = soup.find('table')
# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]
# Extract rows
rows = []
for tr in table.find('tbody').find_all('tr'):
row = [td.text.strip() for td in tr.find_all('td')]
rows.append(row)
# Create DataFrame
df = pd.DataFrame(rows, columns=headers)
print(df)
# Method 3: More complex table with attributes
def extract_complex_table(soup, table_id):
"""Extract table with additional attributes"""
table = soup.find('table', id=table_id)
if not table:
return None
# Extract headers (handle colspan/rowspan)
headers = []
header_rows = table.find('thead').find_all('tr') if table.find('thead') else []
for row in header_rows:
headers.extend([th.text.strip() for th in row.find_all(['th', 'td'])])
# Extract data
data = []
body = table.find('tbody') or table
for row in body.find_all('tr'):
if row.find('th'): # Skip header rows in tbody
continue
row_data = {}
cells = row.find_all(['td', 'th'])
for i, cell in enumerate(cells):
if i < len(headers):
# Handle various data types
value = cell.text.strip()
# Check for links
if cell.find('a'):
row_data[f"{headers[i]}_link"] = cell.find('a').get('href')
# Check for images
if cell.find('img'):
row_data[f"{headers[i]}_img"] = cell.find('img').get('src')
row_data[headers[i]] = value
if row_data:
data.append(row_data)
return pd.DataFrame(data)
Pagination Handling
# Scraping multiple pages
def scrape_with_pagination(base_url, max_pages=None):
"""Handle different pagination styles"""
all_data = []
page = 1
session = requests.Session()
while True:
# Construct URL (adjust based on site's pagination style)
# Style 1: Query parameter
url = f"{base_url}?page={page}"
# Style 2: Path-based
# url = f"{base_url}/page/{page}"
# Style 3: Offset-based
# url = f"{base_url}?offset={(page-1)*20}"
try:
response = session.get(url)
if response.status_code != 200:
break
soup = BeautifulSoup(response.content, 'lxml')
# Extract data from current page
items = soup.select('.item') # Adjust selector
if not items:
print(f"No items found on page {page}")
break
for item in items:
# Extract item data
data = extract_item_data(item)
all_data.append(data)
print(f"Scraped page {page}: {len(items)} items")
# Check for next page
next_link = soup.select_one('.pagination .next, [rel="next"], .next-page')
if not next_link or next_link.get('disabled') or next_link.get('aria-disabled') == 'true':
print("No more pages")
break
# Respect rate limits
time.sleep(1)
page += 1
if max_pages and page > max_pages:
break
except Exception as e:
print(f"Error on page {page}: {e}")
break
return all_data
def extract_item_data(item_element):
"""Extract data from individual item"""
return {
'title': item_element.select_one('.title').text.strip(),
'price': item_element.select_one('.price').text.strip(),
# Add more fields as needed
}
Advanced Techniques
Handling Forms and Sessions
# Login and maintain session
def login_and_scrape(login_url, username, password, target_url):
"""Login to website and scrape protected content"""
session = requests.Session()
# Get login page
login_page = session.get(login_url)
soup = BeautifulSoup(login_page.content, 'lxml')
# Find CSRF token (if present)
csrf_token = soup.find('input', {'name': 'csrf_token'})
# Prepare login data
login_data = {
'username': username,
'password': password
}
if csrf_token:
login_data['csrf_token'] = csrf_token.get('value')
# Submit login form
response = session.post(login_url, data=login_data)
if response.status_code == 200:
print("Login successful")
# Now access protected content
protected_page = session.get(target_url)
soup = BeautifulSoup(protected_page.content, 'lxml')
# Extract protected data
return soup
else:
print("Login failed")
return None
# Handle cookies
def scrape_with_cookies():
"""Use cookies for authentication"""
cookies = {
'session_id': 'abc123',
'auth_token': 'xyz789'
}
response = requests.get('https://example.com/protected', cookies=cookies)
soup = BeautifulSoup(response.content, 'lxml')
return soup
Error Handling and Robustness
# Robust scraping with error handling
import logging
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RobustScraper:
def __init__(self, base_url, max_retries=3):
self.base_url = base_url
self.session = self._create_session(max_retries)
self.scraped_urls = set()
def _create_session(self, max_retries):
"""Create session with retry strategy"""
session = requests.Session()
retry_strategy = Retry(
total=max_retries,
status_forcelist=[429, 500, 502, 503, 504],
method_whitelist=["HEAD", "GET", "OPTIONS"],
backoff_factor=1
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)
# Set headers
session.headers.update({
'User-Agent': 'Mozilla/5.0 (compatible; PythonScraper/1.0)'
})
return session
def scrape_url(self, url):
"""Scrape a single URL with error handling"""
# Check if already scraped
if url in self.scraped_urls:
logger.info(f"Already scraped: {url}")
return None
try:
response = self.session.get(url, timeout=10)
response.raise_for_status()
self.scraped_urls.add(url)
soup = BeautifulSoup(response.content, 'lxml')
return soup
except requests.exceptions.Timeout:
logger.error(f"Timeout error for {url}")
except requests.exceptions.ConnectionError:
logger.error(f"Connection error for {url}")
except requests.exceptions.HTTPError as e:
logger.error(f"HTTP error {e.response.status_code} for {url}")
except Exception as e:
logger.error(f"Unexpected error for {url}: {e}")
return None
def extract_links(self, soup, pattern=None):
"""Extract all links from page"""
links = set()
for link in soup.find_all('a', href=True):
url = urljoin(self.base_url, link['href'])
# Filter links based on pattern
if pattern and not re.match(pattern, url):
continue
links.add(url)
return links
def scrape_site(self, start_url, max_pages=100):
"""Scrape entire site with crawling"""
to_scrape = {start_url}
scraped_data = []
while to_scrape and len(self.scraped_urls) < max_pages:
url = to_scrape.pop()
soup = self.scrape_url(url)
if not soup:
continue
# Extract data from page
data = self.extract_page_data(soup, url)
if data:
scraped_data.append(data)
# Find more links
new_links = self.extract_links(soup)
to_scrape.update(new_links - self.scraped_urls)
# Rate limiting
time.sleep(1)
return scraped_data
def extract_page_data(self, soup, url):
"""Override this method for specific data extraction"""
return {
'url': url,
'title': soup.find('title').text if soup.find('title') else '',
'h1': soup.find('h1').text if soup.find('h1') else '',
'paragraphs': len(soup.find_all('p')),
'links': len(soup.find_all('a'))
}
Best Practices
Ethical Scraping Guidelines
# Always follow ethical scraping practices
def check_robots_txt(domain):
"""Check robots.txt before scraping"""
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url(f"{domain}/robots.txt")
rp.read()
# Check if URL is allowed
can_fetch = rp.can_fetch("*", f"{domain}/page")
crawl_delay = rp.crawl_delay("*")
return can_fetch, crawl_delay
# Respectful scraping class
class EthicalScraper:
def __init__(self, domain, user_agent="PythonBot/1.0"):
self.domain = domain
self.user_agent = user_agent
self.session = requests.Session()
self.session.headers.update({'User-Agent': user_agent})
# Check robots.txt
self.can_scrape, self.delay = check_robots_txt(domain)
if not self.can_scrape:
raise Exception(f"Scraping not allowed for {domain}")
# Default delay if not specified
self.delay = self.delay or 1
def scrape(self, url):
"""Scrape with rate limiting"""
if not self.can_scrape:
return None
response = self.session.get(url)
# Respect rate limits
time.sleep(self.delay)
return BeautifulSoup(response.content, 'lxml')
# Best practices checklist
scraping_best_practices = """
1. ALWAYS check robots.txt
2. Identify yourself with proper User-Agent
3. Respect rate limits and add delays
4. Cache responses to avoid re-scraping
5. Handle errors gracefully
6. Don't overload servers
7. Check website's terms of service
8. Use APIs when available
9. Be prepared to stop if asked
10. Store data responsibly
"""
print(scraping_best_practices)
Practice Exercises
Exercise 1: Build a Price Monitor
Create a price monitoring system that:
- Scrapes product prices from multiple e-commerce sites
- Stores historical price data
- Sends alerts when prices drop
- Generates price trend visualizations
- Handles different site structures
Exercise 2: News Aggregator
Build a news aggregation system:
- Scrape articles from multiple news sources
- Extract title, author, date, and content
- Remove duplicate articles
- Categorize articles by topic
- Create a searchable database
Exercise 3: Job Market Analyzer
Develop a job market analysis tool:
- Scrape job listings from job boards
- Extract skills, salary, and requirements
- Analyze trending skills
- Geographic distribution of jobs
- Generate market insights report
Key Takeaways
- 🕷️ BeautifulSoup makes HTML parsing intuitive and pythonic
- 🔍 Multiple ways to find elements: tags, classes, IDs, CSS selectors
- 🌳 Navigate HTML trees with parent, children, and sibling methods
- ⚡ Choose the right parser for speed vs. leniency trade-off
- 🔄 Handle pagination and dynamic content appropriately
- ⚠️ Always follow ethical scraping practices
- 🛡️ Implement robust error handling for production scraping