I have a file in Persian (a Persian sentence, a "tab", then a Persian word, again a "tab" and then an English word). I have to calculate the number of unique words just in Persian sentences and not the Persian and English words after the tabs. Here's the code:
[from hazm import*
file = "F.txt"
def WordsProbs (file):
words = set()
with open (file, encoding = "utf-8") as f1:
normalizer = Normalizer()
for line in f1:
tmp = line.strip().sp lit("\t")
words = set(normalizer. normalize(tmp[0].split()))
print(len(words ), "unique words")
print (words) ]
To access just the sentences I have to split each line by "\t". And to access each word of the sentence I have to split tmp[0]. The problem is, when I run the code the error below occurs. It's because of the split after tmp[0]. But if I omit this split after tmp[0], it just counts the letters not unique words. How can I fix it? (Is there another way to write this code to calculate unique words?).
The error: Traceback (most recent call last): File "C:\Users\yasin i\Desktop\16.py ", line 15, in WordsProbs (file) File "C:\Users\yasin i\Desktop\16.py ", line 10, in WordsProbs words.update(se t(normalizer.no rmalize(tmp[0].split()))) File "C:\Python34\li b\site-packages\hazm\N ormalizer.py", line 46, in normalize text = self.character_ refinement(text ) File "C:\Python34\li b\site-packages\hazm\N ormalizer.py", line 65, in character_refin ement text = text.translate( self.translatio ns) AttributeError: 'list' object has no attribute 'translate'
sample file: https://www.dropbox.com/s/r88hglemg7aot0w/F.txt?dl=0