BeautifulSoup for HTML Parsing - Python Data Science Path

Extract Data from Any Website! 🕷️

Web scraping opens up a world of data that isn't available through traditional APIs or datasets. BeautifulSoup makes it easy to parse HTML and XML documents, navigate parse trees, and extract the data you need. From e-commerce prices to news articles, master the art of web scraping to gather data from across the internet!

Understanding Web Scraping

graph TD A[Web Page] --> B[HTTP Request] B --> C[HTML Response] C --> D[Parse HTML] D --> E[BeautifulSoup] E --> F[Navigate DOM] F --> G[Extract Data] G --> H[Clean & Store] E --> I[find/find_all] E --> J[CSS Selectors] E --> K[Navigation] I --> L[By Tag] I --> M[By Class] I --> N[By ID] I --> O[By Attributes]

Installation and Setup

# Install required libraries
"""
pip install beautifulsoup4
pip install requests
pip install lxml  # Faster parser
pip install html5lib  # More lenient parser
pip install pandas  # For data storage
"""

# Import libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
from urllib.parse import urljoin, urlparse
import time
import json

# Check versions
import bs4
print(f"BeautifulSoup version: {bs4.__version__}")

# Basic setup - create your first soup
html_doc = """
<html>
<head><title>Sample Page</title></head>
<body>
    <h1>Welcome to Web Scraping</h1>
    <p class="intro">This is a sample paragraph.</p>
    <div id="content">
        <ul>
            <li>Item 1</li>
            <li>Item 2</li>
            <li>Item 3</li>
        </ul>
    </div>
</body>
</html>
"""

# Create BeautifulSoup object
soup = BeautifulSoup(html_doc, 'html.parser')

# Pretty print
print(soup.prettify())

BeautifulSoup Basics

Parsers Comparison

# Different parsers and their characteristics

# 1. html.parser (built-in)
soup = BeautifulSoup(html_doc, 'html.parser')
# Pros: No extra dependencies, decent speed
# Cons: Less lenient with broken HTML

# 2. lxml's HTML parser (recommended)
soup = BeautifulSoup(html_doc, 'lxml')
# Pros: Very fast, lenient
# Cons: Requires C dependency

# 3. lxml's XML parser
soup = BeautifulSoup(html_doc, 'xml')
# Pros: Only currently supported XML parser
# Cons: XML only

# 4. html5lib
soup = BeautifulSoup(html_doc, 'html5lib')
# Pros: Most lenient, creates valid HTML5
# Cons: Slowest

# Performance comparison
import time

parsers = ['html.parser', 'lxml', 'html5lib']
html_large = html_doc * 1000  # Large HTML for testing

for parser in parsers:
    try:
        start = time.time()
        soup = BeautifulSoup(html_large, parser)
        end = time.time()
        print(f"{parser}: {end - start:.4f} seconds")
    except:
        print(f"{parser}: Not available")

Navigating the Parse Tree

# Sample HTML for demonstration
html = """
<div class="container">
    <header>
        <h1>Product Store</h1>
        <nav>
            <a href="/home">Home</a>
            <a href="/products">Products</a>
            <a href="/about">About</a>
        </nav>
    </header>
    <main>
        <div class="product" data-id="1">
            <h2>Laptop</h2>
            <p class="price">$999.99</p>
            <p class="description">High-performance laptop</p>
            <span class="rating">4.5</span>
        </div>
        <div class="product" data-id="2">
            <h2>Phone</h2>
            <p class="price">$699.99</p>
            <p class="description">Latest smartphone</p>
            <span class="rating">4.8</span>
        </div>
    </main>
</div>
"""

soup = BeautifulSoup(html, 'html.parser')

# Accessing tags directly
print(soup.title)  # None (no title in this HTML)
print(soup.h1)  # First h1 tag
print(soup.h1.string)  # Text content

# Navigating using tag names
print(soup.div)  # First div
print(soup.div.h2)  # First h2 inside first div

# Access tag attributes
product = soup.find('div', class_='product')
print(product['data-id'])  # Access attribute
print(product.get('data-id'))  # Safe access
print(product.attrs)  # All attributes

# Navigate children
container = soup.find('div', class_='container')
for child in container.children:
    if child.name:  # Skip text nodes
        print(f"Child tag: {child.name}")

# Navigate descendants (all levels)
for desc in container.descendants:
    if desc.name:
        print(f"Descendant: {desc.name}")

# Navigate parents
price = soup.find('p', class_='price')
print(price.parent.name)  # div
print(price.find_parent('main').name)  # main

# Navigate siblings
first_product = soup.find('div', class_='product')
print(first_product.find_next_sibling('div')['data-id'])  # 2
print(first_product.find_previous_sibling())  # None

Finding Elements

find() and find_all()

# find() - returns first match
# find_all() - returns list of all matches

# Find by tag
soup.find('h2')  # First h2
soup.find_all('p')  # All p tags

# Find by class
soup.find('div', class_='product')  # Note: class_ with underscore
soup.find_all('p', class_='price')

# Find by ID
soup.find('div', id='content')

# Find by attributes
soup.find('div', attrs={'data-id': '1'})
soup.find_all('div', attrs={'class': 'product'})

# Find by multiple criteria
soup.find('p', class_='price', string=re.compile(r'\$\d+'))

# Find with custom function
def has_data_attribute(tag):
    return tag.has_attr('data-id')

soup.find_all(has_data_attribute)

# Limit results
soup.find_all('div', limit=2)

# Find by text content
soup.find_all(string="Laptop")  # Exact match
soup.find_all(string=re.compile("laptop", re.IGNORECASE))  # Regex

# Find parent/children with specific criteria
product = soup.find('div', class_='product')
product.find_all('p')  # Search within element

# Practical examples
def extract_products(soup):
    products = []
    for product in soup.find_all('div', class_='product'):
        data = {
            'id': product.get('data-id'),
            'name': product.find('h2').text.strip(),
            'price': product.find('p', class_='price').text.strip(),
            'description': product.find('p', class_='description').text.strip(),
            'rating': float(product.find('span', class_='rating').text.strip())
        }
        products.append(data)
    return products

products_data = extract_products(soup)
df = pd.DataFrame(products_data)
print(df)

CSS Selectors

# CSS selectors provide powerful and familiar syntax

# Basic selectors
soup.select('p')  # All p tags
soup.select('.price')  # Class selector
soup.select('#content')  # ID selector
soup.select('div.product')  # Tag with class

# Descendant selectors
soup.select('div p')  # p inside div (any level)
soup.select('div > p')  # Direct children only
soup.select('main .product')  # class product inside main

# Multiple selectors
soup.select('p.price, p.description')  # Either selector

# Attribute selectors
soup.select('[data-id]')  # Has attribute
soup.select('[data-id="1"]')  # Specific value
soup.select('[class*="pro"]')  # Contains string
soup.select('[href^="http"]')  # Starts with
soup.select('[href$=".pdf"]')  # Ends with

# Pseudo-selectors
soup.select('div:first-child')
soup.select('p:nth-of-type(2)')
soup.select('li:last-child')

# Complex example
# Find all prices in products with rating > 4.5
products = soup.select('.product')
high_rated = []
for product in products:
    rating = float(product.select_one('.rating').text)
    if rating > 4.5:
        price = product.select_one('.price').text
        high_rated.append(price)

print(f"Prices of high-rated products: {high_rated}")

# One-liner using list comprehension
high_rated_prices = [
    p.select_one('.price').text 
    for p in soup.select('.product') 
    if float(p.select_one('.rating').text) > 4.5
]

Real Website Scraping

Scraping a News Website

# Example: Scraping news articles (use responsibly!)

def scrape_news_site(url):
    """Scrape news articles from a website"""
    
    # Set headers to appear as a regular browser
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
    }
    
    try:
        # Make request
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Raise exception for bad status codes
        
        # Parse HTML
        soup = BeautifulSoup(response.content, 'lxml')
        
        # Remove script and style elements
        for script in soup(['script', 'style']):
            script.decompose()
        
        articles = []
        
        # Example: Find article containers (adjust selectors for specific site)
        article_containers = soup.find_all('article') or soup.find_all('div', class_='article')
        
        for article in article_containers:
            try:
                # Extract data (adjust based on site structure)
                data = {
                    'title': article.find(['h1', 'h2', 'h3']).text.strip(),
                    'summary': article.find('p').text.strip() if article.find('p') else '',
                    'link': article.find('a')['href'] if article.find('a') else '',
                    'date': article.find('time')['datetime'] if article.find('time') else '',
                    'author': article.find(class_='author').text.strip() if article.find(class_='author') else ''
                }
                
                # Clean data
                data = {k: v.replace('\n', ' ').strip() for k, v in data.items()}
                articles.append(data)
                
            except AttributeError:
                continue  # Skip if structure doesn't match
        
        return articles
        
    except requests.RequestException as e:
        print(f"Error fetching {url}: {e}")
        return []

# Example usage (respect robots.txt!)
# articles = scrape_news_site('https://example-news.com')
# df = pd.DataFrame(articles)
# df.to_csv('news_articles.csv', index=False)

E-commerce Product Scraping

# Scraping product information

def scrape_products(url, max_pages=5):
    """Scrape products from e-commerce site"""
    
    all_products = []
    session = requests.Session()
    session.headers.update({
        'User-Agent': 'Mozilla/5.0 (compatible; DataScienceBot/1.0)'
    })
    
    for page in range(1, max_pages + 1):
        # Add pagination parameter
        params = {'page': page}
        
        try:
            response = session.get(url, params=params)
            if response.status_code != 200:
                break
                
            soup = BeautifulSoup(response.content, 'lxml')
            
            # Find product containers (example selectors)
            products = soup.select('.product-item, .product-card, [data-testid="product"]')
            
            if not products:
                print(f"No products found on page {page}")
                break
            
            for product in products:
                try:
                    # Extract product details
                    name = product.select_one('.product-name, .title, h3, h4')
                    price = product.select_one('.price, .product-price, [data-testid="price"]')
                    rating = product.select_one('.rating, .stars, [aria-label*="rating"]')
                    image = product.select_one('img')
                    link = product.select_one('a')
                    
                    # Extract and clean data
                    product_data = {
                        'name': name.text.strip() if name else 'N/A',
                        'price': extract_price(price.text) if price else None,
                        'rating': extract_rating(rating) if rating else None,
                        'image_url': image.get('src', '') if image else '',
                        'product_url': urljoin(url, link.get('href', '')) if link else '',
                        'scraped_at': pd.Timestamp.now()
                    }
                    
                    all_products.append(product_data)
                    
                except Exception as e:
                    print(f"Error parsing product: {e}")
                    continue
            
            print(f"Scraped page {page}: {len(products)} products")
            
            # Be respectful - add delay between requests
            time.sleep(1)
            
        except Exception as e:
            print(f"Error on page {page}: {e}")
            break
    
    return all_products

def extract_price(price_text):
    """Extract numerical price from text"""
    # Remove currency symbols and extract number
    price = re.findall(r'[\d,]+\.?\d*', price_text.replace(',', ''))
    return float(price[0]) if price else None

def extract_rating(rating_element):
    """Extract rating from various formats"""
    if not rating_element:
        return None
    
    # Try different extraction methods
    # Method 1: aria-label
    if rating_element.get('aria-label'):
        rating = re.findall(r'[\d.]+', rating_element['aria-label'])
        return float(rating[0]) if rating else None
    
    # Method 2: Text content
    rating = re.findall(r'[\d.]+', rating_element.text)
    return float(rating[0]) if rating else None

# Usage example
# products = scrape_products('https://example-shop.com/products')
# df = pd.DataFrame(products)
# print(df.head())
# df.to_csv('products.csv', index=False)

Handling Dynamic Content

JavaScript-Rendered Content

# Sometimes content is loaded by JavaScript after page load
# BeautifulSoup alone won't see this content

# Method 1: Find API endpoints
def find_api_data(url):
    """Look for API calls in network tab"""
    
    # First, get the initial HTML
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')
    
    # Look for JavaScript variables containing data
    scripts = soup.find_all('script')
    for script in scripts:
        if script.string and 'window.__INITIAL_STATE__' in script.string:
            # Extract JSON data
            json_text = re.search(r'window\.__INITIAL_STATE__ = ({.*?});', 
                                 script.string, re.DOTALL)
            if json_text:
                data = json.loads(json_text.group(1))
                return data
    
    return None

# Method 2: Use requests-html (renders JavaScript)
"""
from requests_html import HTMLSession

session = HTMLSession()
response = session.get(url)
response.html.render()  # This will execute JavaScript
soup = BeautifulSoup(response.html.html, 'lxml')
"""

# Method 3: Selenium (covered in separate lesson)
# For complex JavaScript sites, use Selenium

# Method 4: Find XHR/Fetch requests
def find_ajax_endpoints(url):
    """
    Use browser developer tools to find AJAX endpoints:
    1. Open Network tab
    2. Filter by XHR or Fetch
    3. Look for JSON responses
    4. Use those endpoints directly
    """
    pass

Data Extraction Patterns

Tables to DataFrames

# Extract HTML tables directly to pandas
html_table = """
<table id="data-table">
    <thead>
        <tr>
            <th>Name</th>
            <th>Age</th>
            <th>City</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Alice</td>
            <td>30</td>
            <td>New York</td>
        </tr>
        <tr>
            <td>Bob</td>
            <td>25</td>
            <td>Los Angeles</td>
        </tr>
    </tbody>
</table>
"""

# Method 1: Using pandas read_html
dfs = pd.read_html(html_table)
df = dfs[0]  # First table
print(df)

# Method 2: Manual extraction with BeautifulSoup
soup = BeautifulSoup(html_table, 'lxml')
table = soup.find('table')

# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]

# Extract rows
rows = []
for tr in table.find('tbody').find_all('tr'):
    row = [td.text.strip() for td in tr.find_all('td')]
    rows.append(row)

# Create DataFrame
df = pd.DataFrame(rows, columns=headers)
print(df)

# Method 3: More complex table with attributes
def extract_complex_table(soup, table_id):
    """Extract table with additional attributes"""
    
    table = soup.find('table', id=table_id)
    if not table:
        return None
    
    # Extract headers (handle colspan/rowspan)
    headers = []
    header_rows = table.find('thead').find_all('tr') if table.find('thead') else []
    for row in header_rows:
        headers.extend([th.text.strip() for th in row.find_all(['th', 'td'])])
    
    # Extract data
    data = []
    body = table.find('tbody') or table
    for row in body.find_all('tr'):
        if row.find('th'):  # Skip header rows in tbody
            continue
        
        row_data = {}
        cells = row.find_all(['td', 'th'])
        
        for i, cell in enumerate(cells):
            if i < len(headers):
                # Handle various data types
                value = cell.text.strip()
                
                # Check for links
                if cell.find('a'):
                    row_data[f"{headers[i]}_link"] = cell.find('a').get('href')
                
                # Check for images
                if cell.find('img'):
                    row_data[f"{headers[i]}_img"] = cell.find('img').get('src')
                
                row_data[headers[i]] = value
        
        if row_data:
            data.append(row_data)
    
    return pd.DataFrame(data)

Pagination Handling

# Scraping multiple pages

def scrape_with_pagination(base_url, max_pages=None):
    """Handle different pagination styles"""
    
    all_data = []
    page = 1
    session = requests.Session()
    
    while True:
        # Construct URL (adjust based on site's pagination style)
        # Style 1: Query parameter
        url = f"{base_url}?page={page}"
        # Style 2: Path-based
        # url = f"{base_url}/page/{page}"
        # Style 3: Offset-based
        # url = f"{base_url}?offset={(page-1)*20}"
        
        try:
            response = session.get(url)
            if response.status_code != 200:
                break
                
            soup = BeautifulSoup(response.content, 'lxml')
            
            # Extract data from current page
            items = soup.select('.item')  # Adjust selector
            
            if not items:
                print(f"No items found on page {page}")
                break
            
            for item in items:
                # Extract item data
                data = extract_item_data(item)
                all_data.append(data)
            
            print(f"Scraped page {page}: {len(items)} items")
            
            # Check for next page
            next_link = soup.select_one('.pagination .next, [rel="next"], .next-page')
            if not next_link or next_link.get('disabled') or next_link.get('aria-disabled') == 'true':
                print("No more pages")
                break
            
            # Respect rate limits
            time.sleep(1)
            
            page += 1
            if max_pages and page > max_pages:
                break
                
        except Exception as e:
            print(f"Error on page {page}: {e}")
            break
    
    return all_data

def extract_item_data(item_element):
    """Extract data from individual item"""
    return {
        'title': item_element.select_one('.title').text.strip(),
        'price': item_element.select_one('.price').text.strip(),
        # Add more fields as needed
    }

Advanced Techniques

Handling Forms and Sessions

# Login and maintain session

def login_and_scrape(login_url, username, password, target_url):
    """Login to website and scrape protected content"""
    
    session = requests.Session()
    
    # Get login page
    login_page = session.get(login_url)
    soup = BeautifulSoup(login_page.content, 'lxml')
    
    # Find CSRF token (if present)
    csrf_token = soup.find('input', {'name': 'csrf_token'})
    
    # Prepare login data
    login_data = {
        'username': username,
        'password': password
    }
    
    if csrf_token:
        login_data['csrf_token'] = csrf_token.get('value')
    
    # Submit login form
    response = session.post(login_url, data=login_data)
    
    if response.status_code == 200:
        print("Login successful")
        
        # Now access protected content
        protected_page = session.get(target_url)
        soup = BeautifulSoup(protected_page.content, 'lxml')
        
        # Extract protected data
        return soup
    else:
        print("Login failed")
        return None

# Handle cookies
def scrape_with_cookies():
    """Use cookies for authentication"""
    
    cookies = {
        'session_id': 'abc123',
        'auth_token': 'xyz789'
    }
    
    response = requests.get('https://example.com/protected', cookies=cookies)
    soup = BeautifulSoup(response.content, 'lxml')
    return soup

Error Handling and Robustness

# Robust scraping with error handling

import logging
from urllib.parse import urljoin
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class RobustScraper:
    def __init__(self, base_url, max_retries=3):
        self.base_url = base_url
        self.session = self._create_session(max_retries)
        self.scraped_urls = set()
    
    def _create_session(self, max_retries):
        """Create session with retry strategy"""
        session = requests.Session()
        
        retry_strategy = Retry(
            total=max_retries,
            status_forcelist=[429, 500, 502, 503, 504],
            method_whitelist=["HEAD", "GET", "OPTIONS"],
            backoff_factor=1
        )
        
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        
        # Set headers
        session.headers.update({
            'User-Agent': 'Mozilla/5.0 (compatible; PythonScraper/1.0)'
        })
        
        return session
    
    def scrape_url(self, url):
        """Scrape a single URL with error handling"""
        
        # Check if already scraped
        if url in self.scraped_urls:
            logger.info(f"Already scraped: {url}")
            return None
        
        try:
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            
            self.scraped_urls.add(url)
            
            soup = BeautifulSoup(response.content, 'lxml')
            return soup
            
        except requests.exceptions.Timeout:
            logger.error(f"Timeout error for {url}")
        except requests.exceptions.ConnectionError:
            logger.error(f"Connection error for {url}")
        except requests.exceptions.HTTPError as e:
            logger.error(f"HTTP error {e.response.status_code} for {url}")
        except Exception as e:
            logger.error(f"Unexpected error for {url}: {e}")
        
        return None
    
    def extract_links(self, soup, pattern=None):
        """Extract all links from page"""
        links = set()
        
        for link in soup.find_all('a', href=True):
            url = urljoin(self.base_url, link['href'])
            
            # Filter links based on pattern
            if pattern and not re.match(pattern, url):
                continue
                
            links.add(url)
        
        return links
    
    def scrape_site(self, start_url, max_pages=100):
        """Scrape entire site with crawling"""
        
        to_scrape = {start_url}
        scraped_data = []
        
        while to_scrape and len(self.scraped_urls) < max_pages:
            url = to_scrape.pop()
            
            soup = self.scrape_url(url)
            if not soup:
                continue
            
            # Extract data from page
            data = self.extract_page_data(soup, url)
            if data:
                scraped_data.append(data)
            
            # Find more links
            new_links = self.extract_links(soup)
            to_scrape.update(new_links - self.scraped_urls)
            
            # Rate limiting
            time.sleep(1)
        
        return scraped_data
    
    def extract_page_data(self, soup, url):
        """Override this method for specific data extraction"""
        return {
            'url': url,
            'title': soup.find('title').text if soup.find('title') else '',
            'h1': soup.find('h1').text if soup.find('h1') else '',
            'paragraphs': len(soup.find_all('p')),
            'links': len(soup.find_all('a'))
        }

Best Practices

Ethical Scraping Guidelines

# Always follow ethical scraping practices

def check_robots_txt(domain):
    """Check robots.txt before scraping"""
    from urllib.robotparser import RobotFileParser
    
    rp = RobotFileParser()
    rp.set_url(f"{domain}/robots.txt")
    rp.read()
    
    # Check if URL is allowed
    can_fetch = rp.can_fetch("*", f"{domain}/page")
    crawl_delay = rp.crawl_delay("*")
    
    return can_fetch, crawl_delay

# Respectful scraping class
class EthicalScraper:
    def __init__(self, domain, user_agent="PythonBot/1.0"):
        self.domain = domain
        self.user_agent = user_agent
        self.session = requests.Session()
        self.session.headers.update({'User-Agent': user_agent})
        
        # Check robots.txt
        self.can_scrape, self.delay = check_robots_txt(domain)
        
        if not self.can_scrape:
            raise Exception(f"Scraping not allowed for {domain}")
        
        # Default delay if not specified
        self.delay = self.delay or 1
    
    def scrape(self, url):
        """Scrape with rate limiting"""
        
        if not self.can_scrape:
            return None
        
        response = self.session.get(url)
        
        # Respect rate limits
        time.sleep(self.delay)
        
        return BeautifulSoup(response.content, 'lxml')

# Best practices checklist
scraping_best_practices = """
1. ALWAYS check robots.txt
2. Identify yourself with proper User-Agent
3. Respect rate limits and add delays
4. Cache responses to avoid re-scraping
5. Handle errors gracefully
6. Don't overload servers
7. Check website's terms of service
8. Use APIs when available
9. Be prepared to stop if asked
10. Store data responsibly
"""

print(scraping_best_practices)

Practice Exercises

Exercise 1: Build a Price Monitor

Create a price monitoring system that:

Scrapes product prices from multiple e-commerce sites
Stores historical price data
Sends alerts when prices drop
Generates price trend visualizations
Handles different site structures

Exercise 2: News Aggregator

Build a news aggregation system:

Scrape articles from multiple news sources
Extract title, author, date, and content
Remove duplicate articles
Categorize articles by topic
Create a searchable database

Exercise 3: Job Market Analyzer

Develop a job market analysis tool:

Scrape job listings from job boards
Extract skills, salary, and requirements
Analyze trending skills
Geographic distribution of jobs
Generate market insights report

Key Takeaways

🕷️ BeautifulSoup makes HTML parsing intuitive and pythonic
🔍 Multiple ways to find elements: tags, classes, IDs, CSS selectors
🌳 Navigate HTML trees with parent, children, and sibling methods
⚡ Choose the right parser for speed vs. leniency trade-off
🔄 Handle pagination and dynamic content appropriately
⚠️ Always follow ethical scraping practices
🛡️ Implement robust error handling for production scraping