473,698 Members | 2,445 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Building a word list from multiple files

Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,lo ad
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??
Thanks in advance.
Manu
Jul 18 '05 #1
5 1695
Manu wrote:
Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,lo ad
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??
Thanks in advance.
Manu


Manu,

There are some things we would need to know to specifically
answer your question. I've tried to answer it with some
"assumption s" about your data/usage:

1) How large are the files you are reading (e.g. can they
fit in memory)?

If not, you will need to read the file a line at a time
and process each line individually.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

3) Do the "files" change a lot?

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you
add/change one of the files run this process to recreate
and shelve the new dictionary. In your main program
get the shelved dictionary from the preprocess program
so that you don't have to process all the files every
time.

Hope info helps,
Larry Bates
Syscon, Inc.
Jul 18 '05 #2
Larry Bates wrote:
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.


If you go this way, you probably ought to read this thread:

http://mail.python.org/pipermail/pyt...er/250520.html

which suggests finding words with a regexp something like r'[^\W\d_]+'.
(If you're not concerned about internationaliz ation, it could be simpler.)

STeve
Jul 18 '05 #3
hi,
1) How large are the files you are reading (e.g. can they
fit in memory)?
The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).
in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you


This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
Thanks
Manu
Jul 18 '05 #4
Manu wrote:
hi,

1) How large are the files you are reading (e.g. can they
fit in memory)?


The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.


The email package can do that parsing for you -- it's not too difficult
to feed it a raw message file and get back only the text and/or html
payload.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you


This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.


Use the shelve module instead of eval()ing it yourself -- the shelve
authors have already done all of the hard work for you. It'll act
almost like a regular dictionary, but is extremely easy to save to disk
and reload later.

This is why Python is called "batteries included". :)

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #5
With email messages they should be small enough so reading
them into memory isn't an issue so line-by-line processing
isn't indicated here.

Email messages have LOTS of punctuation in the other than
witespace between words. Just look at your email message
below. It contains:
greater than symbol ) parenthesis
.. periods
? question marks
, commas

Even text like: "html.So no line.." Periods with no
whitespace will be a problem string split would
return "html.So" as a word.

I really think you are going to need to use regex to
split this into "words" and even then the words may
be of questionable origin. See another response for
an example regex expression that might work. Constructs
like e.g. will return two words "e" and "g" (which
might be ok for your application).

Hope feedback at least helps.

Larry Bates
Manu wrote: hi,
1) How large are the files you are reading (e.g. can they
fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.
If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
Thanks
Manu

Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1734
by: PatchFactory Support | last post by:
Description: Professional and easy-to-use patch building environment that can help you to create instant patch packages for software and file updating. Generated patch packages are small size self-extracting executable update programs in a famous installer style with adjustable user-friendly interface and multilingual support. Enhanced with features like easy-to-use interface including a Wizard mode, powerful patch engine, integrated...
7
2087
by: MLH | last post by:
Building Applications with Microsoft Access 97 is a publication I think I need. Is it available in book form? Is MicroSoft the sole vendor? Anybody got a copy they wanna sell???
3
4887
by: mphanke | last post by:
Hi, I'm writting an application based on an SQL Server for Order Management. I have some data I want to export to Excel and Word, maybe some day I will implement a serial letter. The problem is I don't have Office installed on my Computer. Is there something I can use for this task on the MSDN-DVDs? Where can I find a couple tutorials on dealing with things like that on the web???
8
7517
by: Frost | last post by:
Hi All, I am a newbie i have written a c program on unix for line by line comparison for two files now could some one help on how i could do word by word comparison in case both lines have the same words but in jumbled order they should match and print only the dissimilar lines.The program also checks for multiple entries of the same line. Here file 2 converts to file 3 which is in the format of file1 and i compare file1 with file3.
12
10670
by: Dixie | last post by:
Is there a way to shell to Microsoft Word from Access and load a specific template - using VBA? dixie
6
2651
by: Bob Alston | last post by:
Looking for someone with experience building apps with multiple instances of forms open. I am building an app for a nonprofit organizations case workers. They provide services to the elderly. so far I have built a traditional app, switchboard, forms, etc. Part of this app is to automate the forms they previously prepared manually. After the app was built and works just fine, I find out there are several case managers using MS word...
0
1919
by: alivip | last post by:
I write code to get most frequent words in the file I won't to implement bigram probability by modifying the code to do the following: How can I get every Token (word) and PreviousToken(Previous word) and frequency and probability From text file and put each one in cell in table For example if the text file content is "Every man has a price. Every woman has a price." First Token(word) is "Every" PreviousToken(Previous...
5
2538
by: alivip | last post by:
How can I get every Token (word) and PreviousToken(Previous word) From multube files and frequency of each two word my code is trying to get all single word and double word (every Token (word) and PreviousToken(Previous word)) from multube files and get frequency of both. it can get for single word but double word give error line 50, in most_frequant_word word1+= ' ' + word_list IndexError: list index out of range import...
7
7152
Curtis Rutland
by: Curtis Rutland | last post by:
Building A Silverlight (2.0) Multi-File Uploader All source code is C#. VB.NET source is coming soon. Note: This project requires Visual Studio 2008 SP1 or Visual Web Developer 2008 SP1 and Silverlight 2.0. To get these tools please visit this page Get Started : The Official Microsoft Silverlight Site and follow Step 1. Occasionally you find the need to have users upload multiple files at once. You could use multiple FileUpload...
0
8675
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8604
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9029
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8897
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8862
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
4619
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3050
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2331
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2002
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.