473,808 Members | 2,835 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

How to split a text to words and then filter it in Python?

Hi guys,
I have got a question that I would love for you guys to give me an idea on how to get started with.

First of all, I'm using Windows 7 and Python 2.7

I've got a text file and im trying to read the text from that file and then check every word with 40 words around that word, to make sure the word in question has not been repeated more than once.

In other words, I want to first split the text into words, put them in a list and then check [0] against [1] all the way to [39]. Then I want to check [1] against [40], then check [2] against [41] etc.

Splitting the words is not that hard I think, I just need to split at every space and every dot. What I am not sure how to do is check the words against the other words in the text..
Any ideas guys on how that could be done? =)
Nov 1 '10 #1
3 4283
bvdet
2,851 Recognized Expert Moderator Specialist
Yes, I have an idea on how it could be done. Split the text into a list of words, convert to lower case and strip any punctuation. Iterate on the list and create a sublist by slicing the list as in words[lowIdx:highIdx]. Adjust the low and high indices as required when near the start and end of the word list. Pop the current word from the sublist. Iterate on the remaining members of the sublist to compare to the current word.

The best way to learn how to program in Python is to write programs. Try writing the code and post back with your questions.
Nov 1 '10 #2
Hi again
I'll explain further what I want to do. I want to read in a text file and check if a word appears more than once in the last 40 words, in other words; I want to filter out words by adding doing this *RandomWordThat AppearedMoreTha nOnceInTheLast4 0Words*.

Here is the code I've been working on so far. Currently I'm ignoring all the dots, semicolons etc. I just want to get the basics done.


Expand|Select|Wrap|Line Numbers
  1. infil = open ('story.txt')
  2.  
  3. line = infil.readlines()
  4.  
  5. wordlist = list()
  6.  
  7. allTheWords = line.split()
  8.  
  9.  
  10. if string in dictionary:
  11.     dictionary(string) += 1
  12.     else:
  13.         dictionary(string) = 1
  14.  
  15.  
  16. if len(wordlist) > 40:
  17.     del wordlist[0]
  18.  
  19.  
  20.  
  21.  
  22.  
  23.  
  24.  
  25.  
  26.  
  27.  
  28.  
  29. finishedText = (' ').join(allTheWords)
  30.  
Nov 2 '10 #3
dwblas
626 Recognized Expert Contributor
You should print "line", "allTheWord s", and "wordlist" to see if they contain what you think they do. Also, the indentation for the if and else is incorrect. You can get the last 40 lines with
wordlist[0:40]
See Section 14.5 here for an example of reading a file, and then substitute that name of the file you wish to read.
Some info on lists http://www.greenteapress.com/thinkpy...l/book011.html
Nov 3 '10 #4

Sign in to post your reply or Sign up for a free account.

Similar topics

2
13250
by: Tim | last post by:
Hi I want to be able to split the contents of a text field into two or maybe three columns. The text field contains text AND HTML mark-up. My initial thought was to find the middle character and then go to the nearest space and split the text that way, but it sometimes splits in the middle of an HTML tag: not pretty! My next idea was to strip the HTML tags from the text and then split it up
4
18296
by: qwweeeit | last post by:
The standard split() can use only one delimiter. To split a text file into words you need multiple delimiters like blank, punctuation, math signs (+-*/), parenteses and so on. I didn't succeeded in using re.split()...
2
1730
by: chrisse | last post by:
Hey, I'm having quite a bit of trouble trying to filter through a list using a text box and multiple combo boxes. When you write anything in the text box the list goes blank. Although when you select something from one of the combo boxes nothing happens at all. Help will be greatly appreciated.
1
1918
by: Raed Sawalha | last post by:
I havea regular expression to text as pairs key:value (?<Keyword>\w+):(?<Value>.*)((?=\W$)|\z) when enter the text as following: x-sender: raed_sawalha@hotmail.com x-receiver: teacher1@eLearning.jo ....blah blah ( I'm using expresso regular expression utility) I successfull get the results but when enter text as
3
2520
by: Microsoft | last post by:
I have a multine list that I would like to split into an array. I paste it into a richtext box and go from there, but it just makes the first part of the array the whole list with little boxes inbetween. Here is the code Dim userdata As String Dim serverarray As String() Dim servercount As Integer
2
1823
by: ownowl | last post by:
Hello beginer under python, I have a problem to get lines in a text file. lines have inside the \n (x0A) char, and le readline method split the line at this char too (not only at x0Dx0A). for resume, I want to split a file to lines with only those chars : x0Dx0A A idea ? thank's Olivier
7
3005
by: Senna_Rettop | last post by:
Hello, I know exactly what I want but don't know the process or codes to do it. Please help. I want a form that has two text boxes, one for names and one for SSN. When I type in either of the two, I want to be able to hit a filter button that shows me all the records that match. I relatively know how to do this with a combo/drop down box, but I can't use those because I can't have the names or SSN availible for everyone to see. ...
3
4875
by: ashok | last post by:
Hi, I need a function that will divide text from mysql in 2 parts, so that I can display first half in one column and second half in second column. I can't find what function will do this job. Here is part of what I have. $myrow = mysql_fetch_assoc($rsindexpage); //function wordcount $wc = str_word_count($myrow); $half = ($wc/2);
13
3197
by: Alligatorr | last post by:
First, I'm completely new to using python, but willing to learn. I have very little programming skills, but again, I'm willing to learn. Second, I have a script (given to me) that receives data from a remote sensor. The data posts to a running python script, each data entry occupying one line. I need to get the data out to log it, graph it, and ultimately overlay it onto an image. Is there a way to do this in real time? I'm aware of...
5
2990
by: kj | last post by:
In Perl, one can break a chunk of text into an array of lines while preserving the trailing line-termination sequence in each line, if any, by splitting the text on the regular expression /^/: DB<1x split(/^/, "foo\nbar\nbaz") 0 'foo ' 1 'bar ' 2 'baz'
0
10631
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10374
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
1
7651
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6880
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5548
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5686
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4331
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3859
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
3011
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.