By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,404 Members | 2,300 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,404 IT Pros & Developers. It's quick & easy.

how to calculate the number of unique words just in a part of a file

P: 3
I have a file in Persian (a Persian sentence, a "tab", then a Persian word, again a "tab" and then an English word). I have to calculate the number of unique words just in Persian sentences and not the Persian and English words after the tabs. Here's the code:


[from hazm import*

file = "F.txt"
def WordsProbs (file):
words = set()
with open (file, encoding = "utf-8") as f1:
normalizer = Normalizer()
for line in f1:
tmp = line.strip().split("\t")
words = set(normalizer.normalize(tmp[0].split()))
print(len(words), "unique words")
print (words) ]



To access just the sentences I have to split each line by "\t". And to access each word of the sentence I have to split tmp[0]. The problem is, when I run the code the error below occurs. It's because of the split after tmp[0]. But if I omit this split after tmp[0], it just counts the letters not unique words. How can I fix it? (Is there another way to write this code to calculate unique words?).

The error: Traceback (most recent call last): File "C:\Users\yasini\Desktop\16.py", line 15, in WordsProbs (file) File "C:\Users\yasini\Desktop\16.py", line 10, in WordsProbs words.update(set(normalizer.normalize(tmp[0].split()))) File "C:\Python34\lib\site-packages\hazm\Normalizer.py", line 46, in normalize text = self.character_refinement(text) File "C:\Python34\lib\site-packages\hazm\Normalizer.py", line 65, in character_refinement text = text.translate(self.translations) AttributeError: 'list' object has no attribute 'translate'

sample file: https://www.dropbox.com/s/r88hglemg7aot0w/F.txt?dl=0
Oct 29 '16 #1
Share this Question
Share on Google+
1 Reply


Expert 100+
P: 621
I can not make much sense of your post. The code should be in code tags to preserve indentation which is important in python (click on the "CODE/" icon and place the code in between). Also the error message is difficult to read because it is posted here all on one line. Finally, we do not know where the "Normalizer" package is from and since it has not been imported, have to assume that this is an error that has to be fixed first.

The error says
AttributeError: 'list' object has no attribute 'translate'
which means you are sending a list to Normalizer and it does not like that, but we have no way of knowing whether you have written a "Normalizer" program or if it is from somewhere else, and so don't know what it expects to receive.

when I run the code the error below occurs. It's because of the split after tmp[0]
First, get the "spliit into words" portion working correctly and then add the "Normalizer". Do this split on a line by itself and print the result to see what it contains. There is not enough info in this post for anyone to test it, i.e. code not formatted properly, what is "Normalizer",
Oct 29 '16 #2

Post your reply

Sign in to post your reply or Sign up for a free account.