Webscraping using BFS

Lab Manual: Web Scraping and BFS-based Link Extraction

Lab Title: Web Scraping and BFS Traversal of a Website

Objective: The purpose of this lab is to learn how to perform web scraping and extract links from a website using Python. We will employ the Breadth-First Search (BFS) technique to traverse a website and collect internal links.

Prerequisites:

Basic understanding of Python.
Knowledge of web scraping concepts.
Familiarity with data structures such as queues.
Installed Python libraries: requests, BeautifulSoup4.

To install required libraries, run the following command:

pip install requests beautifulsoup4

Concepts Covered:

Web scraping using requests and BeautifulSoup.
Validating and extracting internal links.
Implementing BFS traversal to visit website pages systematically.
Writing extracted links to a file.
Handling exceptions and adding delays to avoid overwhelming the server.

Code Breakdown:

1. Checking Validity of Links

def is_valid_link(link):

"""Check if the link is valid and should be followed (e.g., avoid external links)."""

return link and link.startswith("https://priceoye.pk")

This function ensures that only internal links belonging to priceoye.pk are processed.

2. Extracting Links from a Webpage

def get_links_from_page(url):

"""Extract all valid links from the given webpage."""

try:

response = requests.get(url)

if response.status_code != 200:

return []

soup = BeautifulSoup(response.text, 'html.parser')

links = set()

for a_tag in soup.find_all('a', href=True):

href = a_tag['href']

if href.startswith('/'):

href = "https://priceoye.pk" + href

if is_valid_link(href):

links.add(href)

return links

except Exception as e:

print(f"Error fetching {url}: {e}")

return []

This function fetches a webpage, parses it using BeautifulSoup, and extracts all valid internal links.

3. BFS Traversal of the Website

def bfs_traverse_website(start_url, max_depth=3):

"""Perform a breadth-first search (BFS) on the website starting from start_url."""

queue = deque([(start_url, 0)])

visited = set()

with open("priceoyelinks.txt", "w") as f:

while queue:

current_url, depth = queue.popleft()

if current_url in visited or depth > max_depth:

continue

visited.add(current_url)

print(f"Visiting: {current_url} (Depth: {depth})")

f.write(f"{current_url}\n")

links = get_links_from_page(current_url)

for link in links:

if link not in visited:

queue.append((link, depth + 1))

time.sleep(1) # Pause to avoid server overload

The BFS algorithm uses a queue to systematically visit pages and extract links, ensuring all pages are explored up to the specified depth.

4. Running the Script

if __name__ == "__main__":

start_url = "https://priceoye.pk" # Replace with your starting URL

bfs_traverse_website(start_url, max_depth=3)

The script starts the BFS traversal from the given URL and extracts links up to a depth of 3.

Observations:

BFS traversal ensures that all reachable internal links are extracted systematically.
Adding a delay (time.sleep(1)) between requests prevents overwhelming the server.
Writing the extracted links to a file (priceoyelinks.txt) helps in further analysis.
Error handling ensures that the script continues running even if some pages fail to load.

Applications:

Web scraping for price comparison websites.
Crawling and indexing webpages for search engines.
Extracting data for research purposes.
Identifying new product pages or updates on e-commerce sites.

Challenges and Limitations:

Some websites block web scrapers using robots.txt or CAPTCHA.
The script only extracts links but does not fetch additional data like prices or descriptions.
The BFS traversal depth should be chosen wisely to balance between coverage and efficiency.

Extensions and Improvements:

Modify the script to extract additional data like product names and prices.
Implement multi-threading to speed up the crawling process.
Store extracted data in a structured format (e.g., CSV, JSON, database).
Respect robots.txt and avoid scraping restricted pages.

Conclusion:

This lab introduced BFS-based web scraping, demonstrating how to systematically traverse a website and extract internal links. The concepts and code presented here can be extended to build more sophisticated web crawlers for various applications.

Assignment Questions:

What are the advantages of BFS over DFS for web crawling?
How can you modify the script to extract product prices along with links?
What are the ethical considerations when performing web scraping?
How would you implement multi-threading to improve the performance of this scraper?
What challenges might arise when crawling a large e-commerce website?

Page updated

Google Sites

Report abuse