469,591 Members | 1,992 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,591 developers. It's quick & easy.

How to split a text to words and then filter it in Python?

Hi guys,
I have got a question that I would love for you guys to give me an idea on how to get started with.

First of all, I'm using Windows 7 and Python 2.7

I've got a text file and im trying to read the text from that file and then check every word with 40 words around that word, to make sure the word in question has not been repeated more than once.

In other words, I want to first split the text into words, put them in a list and then check [0] against [1] all the way to [39]. Then I want to check [1] against [40], then check [2] against [41] etc.

Splitting the words is not that hard I think, I just need to split at every space and every dot. What I am not sure how to do is check the words against the other words in the text..
Any ideas guys on how that could be done? =)
Nov 1 '10 #1
3 4086
2,851 Expert Mod 2GB
Yes, I have an idea on how it could be done. Split the text into a list of words, convert to lower case and strip any punctuation. Iterate on the list and create a sublist by slicing the list as in words[lowIdx:highIdx]. Adjust the low and high indices as required when near the start and end of the word list. Pop the current word from the sublist. Iterate on the remaining members of the sublist to compare to the current word.

The best way to learn how to program in Python is to write programs. Try writing the code and post back with your questions.
Nov 1 '10 #2
Hi again
I'll explain further what I want to do. I want to read in a text file and check if a word appears more than once in the last 40 words, in other words; I want to filter out words by adding doing this *RandomWordThatAppearedMoreThanOnceInTheLast40Word s*.

Here is the code I've been working on so far. Currently I'm ignoring all the dots, semicolons etc. I just want to get the basics done.

Expand|Select|Wrap|Line Numbers
  1. infil = open ('story.txt')
  3. line = infil.readlines()
  5. wordlist = list()
  7. allTheWords = line.split()
  10. if string in dictionary:
  11.     dictionary(string) += 1
  12.     else:
  13.         dictionary(string) = 1
  16. if len(wordlist) > 40:
  17.     del wordlist[0]
  29. finishedText = (' ').join(allTheWords)
Nov 2 '10 #3
626 Expert 512MB
You should print "line", "allTheWords", and "wordlist" to see if they contain what you think they do. Also, the indentation for the if and else is incorrect. You can get the last 40 lines with
See Section 14.5 here for an example of reading a file, and then substitute that name of the file you wish to read.
Some info on lists http://www.greenteapress.com/thinkpy...l/book011.html
Nov 3 '10 #4

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

2 posts views Thread by Tim | last post: by
4 posts views Thread by qwweeeit | last post: by
1 post views Thread by Raed Sawalha | last post: by
3 posts views Thread by Microsoft | last post: by
2 posts views Thread by ownowl | last post: by
7 posts views Thread by Senna_Rettop | last post: by
3 posts views Thread by ashok | last post: by
5 posts views Thread by kj | last post: by
reply views Thread by suresh191 | last post: by
4 posts views Thread by guiromero | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.