Web scraping
Table of Contents
Description
Web scraping is the process of automatically extracting data from websites. BeautifulSoup is a popular Python library used for parsing HTML and XML documents. It allows you to navigate, search, and modify the parse tree in a readable and Pythonic way.
Advanced BeautifulSoup techniques involve more precise ways of navigating and querying the HTML structure using selectors, tree traversal, and filtering with attributes or CSS classes.
Selenium automates browsers, making it ideal for scraping JavaScript-heavy websites where content loads dynamically.
After scraping, storing data in structured formats like CSV or Excel is essential for analysis and visualization.
Prerequisites
- Basic understanding of HTML structure (tags, attributes)/
- Familiarity with Python syntax
- Libraries: requests, beautifulsoup4
- Basic pandas knowledge
- File I/O familiarity in Python
- Install: selenium, browser driver (like ChromeDriver)
- CSS selector understanding
Examples
Here's a simple example of a data science task using Python:
import requests
from bs4 import BeautifulSoup
# Step 1: Send an HTTP GET request to the webpage
url = 'https://example.com'
response = requests.get(url)
# Step 2: Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')
# Step 3: Extract specific data (e.g., all paragraph tags)
paragraphs = soup.find_all('p')
# Step 4: Print the text inside each paragraph tag
for p in paragraphs:
print(p.text)
#for advanced
from bs4 import BeautifulSoup
import requests
url = 'https://example.com'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
# Find all elements with class 'title'
titles = soup.find_all('h2', class_='title')
# CSS selectors
links = soup.select('div.card > a')
# Navigate siblings
desc = soup.body.p.find_next_sibling('p')
#using selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
driver = webdriver.Chrome()
driver.get("https://example.com")
time.sleep(3)
title = driver.find_element(By.TAG_NAME, 'h1').text
print("Page title:", title)
driver.quit()
#storing scraped data
import pandas as pd
# Example data
data = {'Name': ['Product A', 'Product B'], 'Price': [120, 150]}
df = pd.DataFrame(data)
# Save to CSV
df.to_csv('products.csv', index=False)
# Save to Excel
df.to_excel('products.xlsx', index=False)
📝 Comments:
requests.get() is used to fetch the page content.
BeautifulSoup(response.text, 'html.parser') creates a parsed HTML tree.
find_all('p') finds all paragraph elements.
select() supports powerful CSS-style selectors.
find_next_sibling() helps navigate between elements at the same level.
Real-World Applications
Finance
Scrape stock prices from financial sites
Healthcare
Collect public health data from medical news portals
E-commerce:
Monitor product prices and reviews
Where topic Is Applied
Finance
- Scraping market news and tickers
Healthcare
- Tracking competitor product listings and prices
Marketing
- Analyzing comments and feedback from product review sites
Resources
Data Science topic PDF
Harvard Data Science Course
Free online course from Harvard covering data science foundations
Interview Questions
➤ It's a Python library for parsing HTML/XML to extract data from web pages.
➤ 'html.parser' (built-in), or 'lxml' for faster parsing.
➤ find() returns the first matching tag; find_all() returns a list of all matching tags.
➤ Use .text or .get_text() on a tag object.
➤ requests for HTTP, lxml for parsing, pandas for storing scraped data.