By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
464,820 Members | 1,120 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 464,820 IT Pros & Developers. It's quick & easy.

Stop-word removal and document tokenization using NLTK library

P: 11
I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following error message: AttributeError: 'list' object has no attribute 'lower'. I just can’t figure out what I’m doing wrong, although it’s my first time of doing something like this. Below are my lines of code.I’ll appreciate any suggestions, thanks.

Expand|Select|Wrap|Line Numbers
  1. Import nltk
  2. from nltk.corpus import stopwords
  3. s = open("C:\zircon\sinbo1.txt").read()
  4. tokens = nltk.word_tokenize(s)
  5. def cleanupDoc(s):
  6.     stopset = set(stopwords.words('english'))
  7.     tokens = nltk.word_tokenize(s)
  8.     cleanup = [token.lower()for token in tokens.lower() not in stopset and  len(token)>2]
  9.     return cleanup
  10. cleanupDoc(s)
Jun 30 '13 #1
Share this question for a faster answer!
Share on Google+

Post your reply

Sign in to post your reply or Sign up for a free account.