By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,767 Members | 1,995 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,767 IT Pros & Developers. It's quick & easy.

How to split a text to words and then filter it in Python?

P: n/a
Hi guys,
I have got a question that I would love for you guys to give me an idea on how to get started with.

First of all, I'm using Windows 7 and Python 2.7

I've got a text file and im trying to read the text from that file and then check every word with 40 words around that word, to make sure the word in question has not been repeated more than once.

In other words, I want to first split the text into words, put them in a list and then check [0] against [1] all the way to [39]. Then I want to check [1] against [40], then check [2] against [41] etc.

Splitting the words is not that hard I think, I just need to split at every space and every dot. What I am not sure how to do is check the words against the other words in the text..
Any ideas guys on how that could be done? =)
Nov 1 '10 #1
Share this Question
Share on Google+
3 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
Yes, I have an idea on how it could be done. Split the text into a list of words, convert to lower case and strip any punctuation. Iterate on the list and create a sublist by slicing the list as in words[lowIdx:highIdx]. Adjust the low and high indices as required when near the start and end of the word list. Pop the current word from the sublist. Iterate on the remaining members of the sublist to compare to the current word.

The best way to learn how to program in Python is to write programs. Try writing the code and post back with your questions.
Nov 1 '10 #2

P: n/a
Hi again
I'll explain further what I want to do. I want to read in a text file and check if a word appears more than once in the last 40 words, in other words; I want to filter out words by adding doing this *RandomWordThatAppearedMoreThanOnceInTheLast40Word s*.

Here is the code I've been working on so far. Currently I'm ignoring all the dots, semicolons etc. I just want to get the basics done.


Expand|Select|Wrap|Line Numbers
  1. infil = open ('story.txt')
  2.  
  3. line = infil.readlines()
  4.  
  5. wordlist = list()
  6.  
  7. allTheWords = line.split()
  8.  
  9.  
  10. if string in dictionary:
  11.     dictionary(string) += 1
  12.     else:
  13.         dictionary(string) = 1
  14.  
  15.  
  16. if len(wordlist) > 40:
  17.     del wordlist[0]
  18.  
  19.  
  20.  
  21.  
  22.  
  23.  
  24.  
  25.  
  26.  
  27.  
  28.  
  29. finishedText = (' ').join(allTheWords)
  30.  
Nov 2 '10 #3

Expert 100+
P: 624
You should print "line", "allTheWords", and "wordlist" to see if they contain what you think they do. Also, the indentation for the if and else is incorrect. You can get the last 40 lines with
wordlist[0:40]
See Section 14.5 here for an example of reading a file, and then substitute that name of the file you wish to read.
Some info on lists http://www.greenteapress.com/thinkpy...l/book011.html
Nov 3 '10 #4

Post your reply

Sign in to post your reply or Sign up for a free account.