473,396 Members | 1,754 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Building a word list from multiple files

Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??
Thanks in advance.
Manu
Jul 18 '05 #1
5 1676
Manu wrote:
Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??
Thanks in advance.
Manu


Manu,

There are some things we would need to know to specifically
answer your question. I've tried to answer it with some
"assumptions" about your data/usage:

1) How large are the files you are reading (e.g. can they
fit in memory)?

If not, you will need to read the file a line at a time
and process each line individually.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

3) Do the "files" change a lot?

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you
add/change one of the files run this process to recreate
and shelve the new dictionary. In your main program
get the shelved dictionary from the preprocess program
so that you don't have to process all the files every
time.

Hope info helps,
Larry Bates
Syscon, Inc.
Jul 18 '05 #2
Larry Bates wrote:
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.


If you go this way, you probably ought to read this thread:

http://mail.python.org/pipermail/pyt...er/250520.html

which suggests finding words with a regexp something like r'[^\W\d_]+'.
(If you're not concerned about internationalization, it could be simpler.)

STeve
Jul 18 '05 #3
hi,
1) How large are the files you are reading (e.g. can they
fit in memory)?
The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).
in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you


This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
Thanks
Manu
Jul 18 '05 #4
Manu wrote:
hi,

1) How large are the files you are reading (e.g. can they
fit in memory)?


The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.


The email package can do that parsing for you -- it's not too difficult
to feed it a raw message file and get back only the text and/or html
payload.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you


This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.


Use the shelve module instead of eval()ing it yourself -- the shelve
authors have already done all of the hard work for you. It'll act
almost like a regular dictionary, but is extremely easy to save to disk
and reload later.

This is why Python is called "batteries included". :)

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #5
With email messages they should be small enough so reading
them into memory isn't an issue so line-by-line processing
isn't indicated here.

Email messages have LOTS of punctuation in the other than
witespace between words. Just look at your email message
below. It contains:
greater than symbol ) parenthesis
.. periods
? question marks
, commas

Even text like: "html.So no line.." Periods with no
whitespace will be a problem string split would
return "html.So" as a word.

I really think you are going to need to use regex to
split this into "words" and even then the words may
be of questionable origin. See another response for
an example regex expression that might work. Constructs
like e.g. will return two words "e" and "g" (which
might be ok for your application).

Hope feedback at least helps.

Larry Bates
Manu wrote: hi,
1) How large are the files you are reading (e.g. can they
fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.
If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
Thanks
Manu

Jul 18 '05 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: PatchFactory Support | last post by:
Description: Professional and easy-to-use patch building environment that can help you to create instant patch packages for software and file updating. Generated patch packages are small size...
7
by: MLH | last post by:
Building Applications with Microsoft Access 97 is a publication I think I need. Is it available in book form? Is MicroSoft the sole vendor? Anybody got a copy they wanna sell???
3
by: mphanke | last post by:
Hi, I'm writting an application based on an SQL Server for Order Management. I have some data I want to export to Excel and Word, maybe some day I will implement a serial letter. The problem...
8
by: Frost | last post by:
Hi All, I am a newbie i have written a c program on unix for line by line comparison for two files now could some one help on how i could do word by word comparison in case both lines have the...
12
by: Dixie | last post by:
Is there a way to shell to Microsoft Word from Access and load a specific template - using VBA? dixie
6
by: Bob Alston | last post by:
Looking for someone with experience building apps with multiple instances of forms open. I am building an app for a nonprofit organizations case workers. They provide services to the elderly. ...
0
by: alivip | last post by:
I write code to get most frequent words in the file I won't to implement bigram probability by modifying the code to do the following: How can I get every Token (word) and ...
5
by: alivip | last post by:
How can I get every Token (word) and PreviousToken(Previous word) From multube files and frequency of each two word my code is trying to get all single word and double word (every Token (word) and...
7
Curtis Rutland
by: Curtis Rutland | last post by:
Building A Silverlight (2.0) Multi-File Uploader All source code is C#. VB.NET source is coming soon. Note: This project requires Visual Studio 2008 SP1 or Visual Web Developer 2008 SP1 and...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.