Objective: To develop a Python program that reads an Urdu text file line by line, extracts words, stores them uniquely, and finally sorts and prints them in alphabetical order.
Requirements:
Python 3.x
Urdu text file ("2563.txt")
Basic understanding of string manipulation and file handling in Python
Theory: Natural Language Processing (NLP) involves analyzing and processing textual data. A fundamental step in NLP is tokenization and frequency analysis. This experiment demonstrates proficiency in data collection by extracting unique words from an Urdu ebook and sorting them alphabetically.
Procedure:
Open the file "2563.txt" in read mode using Python.
Read the file line by line.
Tokenize each line into words by splitting it using whitespace.
Maintain a list to store unique words.
If a word is not already in the list, add it.
After processing all lines, sort the list alphabetically.
Print the sorted list of words.
Code Implementation:
# Open the file in read mode with UTF-8 encoding
with open("2563.txt", "r", encoding="utf-8") as file:
words_list = [] # List to store unique words
for line in file:
words = line.strip().split() # Tokenizing the line into words
for word in words:
if word not in words_list:
words_list.append(word)
# Sorting the words alphabetically
words_list.sort()
# Printing the sorted unique words
for word in words_list:
print(word)
Expected Output:
The program prints all unique words from the Urdu text file in sorted order.
Observations:
The presence of punctuation may affect tokenization.
Urdu words with similar spellings but different diacritics may be treated as distinct words.
Conclusion: This experiment demonstrates a simple method for collecting and processing textual data in NLP. The program efficiently extracts and organizes unique words from a dataset, which is a fundamental step in text preprocessing.