Web scraping

Introduction Reading Time: 12 min

Table of Contents

Description

Web scraping is the process of automatically extracting data from websites. BeautifulSoup is a popular Python library used for parsing HTML and XML documents. It allows you to navigate, search, and modify the parse tree in a readable and Pythonic way.
Advanced BeautifulSoup techniques involve more precise ways of navigating and querying the HTML structure using selectors, tree traversal, and filtering with attributes or CSS classes.
Selenium automates browsers, making it ideal for scraping JavaScript-heavy websites where content loads dynamically.
After scraping, storing data in structured formats like CSV or Excel is essential for analysis and visualization.

Prerequisites

  • Basic understanding of HTML structure (tags, attributes)/
  • Familiarity with Python syntax
  • Libraries: requests, beautifulsoup4
  • Basic pandas knowledge
  • File I/O familiarity in Python
  • Install: selenium, browser driver (like ChromeDriver)
  • CSS selector understanding

Examples

Here's a simple example of a data science task using Python:


import requests
from bs4 import BeautifulSoup

# Step 1: Send an HTTP GET request to the webpage
url = 'https://example.com'
response = requests.get(url)

# Step 2: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')

# Step 3: Extract specific data (e.g., all paragraph tags)
paragraphs = soup.find_all('p')

# Step 4: Print the text inside each paragraph tag
for p in paragraphs:
    print(p.text)

#for advanced 
from bs4 import BeautifulSoup
import requests

url = 'https://example.com'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')

# Find all elements with class 'title'
titles = soup.find_all('h2', class_='title')

# CSS selectors
links = soup.select('div.card > a')

# Navigate siblings
desc = soup.body.p.find_next_sibling('p')
#using selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

driver = webdriver.Chrome()
driver.get("https://example.com")
time.sleep(3)

title = driver.find_element(By.TAG_NAME, 'h1').text
print("Page title:", title)

driver.quit()
#storing scraped data
import pandas as pd

# Example data
data = {'Name': ['Product A', 'Product B'], 'Price': [120, 150]}
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('products.csv', index=False)

# Save to Excel
df.to_excel('products.xlsx', index=False)


          

📝 Comments:
requests.get() is used to fetch the page content.
BeautifulSoup(response.text, 'html.parser') creates a parsed HTML tree.
find_all('p') finds all paragraph elements.
select() supports powerful CSS-style selectors.
find_next_sibling() helps navigate between elements at the same level.

Real-World Applications

Finance


Scrape stock prices from financial sites

Healthcare


Collect public health data from medical news portals

E-commerce:


Monitor product prices and reviews

Where topic Is Applied

Finance

  • Scraping market news and tickers

Healthcare

  • Tracking competitor product listings and prices

Marketing

  • Analyzing comments and feedback from product review sites

Resources

Data Science topic PDF

Download

Harvard Data Science Course

Free online course from Harvard covering data science foundations

Visit

Interview Questions

➤ It's a Python library for parsing HTML/XML to extract data from web pages.

➤ 'html.parser' (built-in), or 'lxml' for faster parsing.

➤ find() returns the first matching tag; find_all() returns a list of all matching tags.

➤ Use .text or .get_text() on a tag object.

➤ requests for HTTP, lxml for parsing, pandas for storing scraped data.