Lab Title: Web Scraping and BFS Traversal of a Website
Objective: The purpose of this lab is to learn how to perform web scraping and extract links from a website using Python. We will employ the Breadth-First Search (BFS) technique to traverse a website and collect internal links.
Basic understanding of Python.
Knowledge of web scraping concepts.
Familiarity with data structures such as queues.
Installed Python libraries: requests, BeautifulSoup4.
To install required libraries, run the following command:
pip install requests beautifulsoup4
Web scraping using requests and BeautifulSoup.
Validating and extracting internal links.
Implementing BFS traversal to visit website pages systematically.
Writing extracted links to a file.
Handling exceptions and adding delays to avoid overwhelming the server.
1. Checking Validity of Links
def is_valid_link(link):
"""Check if the link is valid and should be followed (e.g., avoid external links)."""
return link and link.startswith("https://priceoye.pk")
This function ensures that only internal links belonging to priceoye.pk are processed.
2. Extracting Links from a Webpage
def get_links_from_page(url):
"""Extract all valid links from the given webpage."""
try:
response = requests.get(url)
if response.status_code != 200:
return []
soup = BeautifulSoup(response.text, 'html.parser')
links = set()
for a_tag in soup.find_all('a', href=True):
href = a_tag['href']
if href.startswith('/'):
href = "https://priceoye.pk" + href
if is_valid_link(href):
links.add(href)
return links
except Exception as e:
print(f"Error fetching {url}: {e}")
return []
This function fetches a webpage, parses it using BeautifulSoup, and extracts all valid internal links.
3. BFS Traversal of the Website
def bfs_traverse_website(start_url, max_depth=3):
"""Perform a breadth-first search (BFS) on the website starting from start_url."""
queue = deque([(start_url, 0)])
visited = set()
with open("priceoyelinks.txt", "w") as f:
while queue:
current_url, depth = queue.popleft()
if current_url in visited or depth > max_depth:
continue
visited.add(current_url)
print(f"Visiting: {current_url} (Depth: {depth})")
f.write(f"{current_url}\n")
links = get_links_from_page(current_url)
for link in links:
if link not in visited:
queue.append((link, depth + 1))
time.sleep(1) # Pause to avoid server overload
The BFS algorithm uses a queue to systematically visit pages and extract links, ensuring all pages are explored up to the specified depth.
4. Running the Script
if __name__ == "__main__":
start_url = "https://priceoye.pk" # Replace with your starting URL
bfs_traverse_website(start_url, max_depth=3)
The script starts the BFS traversal from the given URL and extracts links up to a depth of 3.
BFS traversal ensures that all reachable internal links are extracted systematically.
Adding a delay (time.sleep(1)) between requests prevents overwhelming the server.
Writing the extracted links to a file (priceoyelinks.txt) helps in further analysis.
Error handling ensures that the script continues running even if some pages fail to load.
Web scraping for price comparison websites.
Crawling and indexing webpages for search engines.
Extracting data for research purposes.
Identifying new product pages or updates on e-commerce sites.
Some websites block web scrapers using robots.txt or CAPTCHA.
The script only extracts links but does not fetch additional data like prices or descriptions.
The BFS traversal depth should be chosen wisely to balance between coverage and efficiency.
Modify the script to extract additional data like product names and prices.
Implement multi-threading to speed up the crawling process.
Store extracted data in a structured format (e.g., CSV, JSON, database).
Respect robots.txt and avoid scraping restricted pages.
This lab introduced BFS-based web scraping, demonstrating how to systematically traverse a website and extract internal links. The concepts and code presented here can be extended to build more sophisticated web crawlers for various applications.
What are the advantages of BFS over DFS for web crawling?
How can you modify the script to extract product prices along with links?
What are the ethical considerations when performing web scraping?
How would you implement multi-threading to improve the performance of this scraper?
What challenges might arise when crawling a large e-commerce website?