Building a word list from multiple files

Manu

Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??
Thanks in advance.
Manu

Jul 18 '05 #1

Subscribe Post Reply

1676

Larry Bates

Manu wrote:

Hi,

Here's what i want to accomplish.
I want to make a list of frequenctly occuring words in a group of
files along with the no of occurances of each
The brute force method will be to read the file as a string,split,load
the words
into a dict with words as key and no of occurances as key.
Load the next file ,iterate through the new words increment the value
if there is
a match or add a new key,value pair if there is none.
repeat for all files.

is there a better way ??
Thanks in advance.
Manu

Manu,

There are some things we would need to know to specifically
answer your question. I've tried to answer it with some
"assumptions" about your data/usage:

1) How large are the files you are reading (e.g. can they
fit in memory)?

If not, you will need to read the file a line at a time
and process each line individually.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

3) Do the "files" change a lot?

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you
add/change one of the files run this process to recreate
and shelve the new dictionary. In your main program
get the shelved dictionary from the preprocess program
so that you don't have to process all the files every
time.

Hope info helps,
Larry Bates
Syscon, Inc.

Jul 18 '05 #2

Steven Bethard

Larry Bates wrote:

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

If not, you will probably need to use regular expressions
to handle all different punctuations that might separate
the words. Things like quotes, commas, periods, colons,
semi-colons, etc. Simple string split won't handle these
properly.

If you go this way, you probably ought to read this thread:

http://mail.python.org/pipermail/pyt...er/250520.html

which suggests finding words with a regexp something like r'[^\W\d_]+'.
(If you're not concerned about internationalization, it could be simpler.)

STeve

Jul 18 '05 #3

Manu

hi,

1) How large are the files you are reading (e.g. can they
fit in memory)?
The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.
2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).
in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
Thanks
Manu

Jul 18 '05 #4

Jeff Shannon

Manu wrote:

hi,

1) How large are the files you are reading (e.g. can they
fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

The email package can do that parsing for you -- it's not too difficult
to feed it a raw message file and get back only the text and/or html
payload.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.

Use the shelve module instead of eval()ing it yourself -- the shelve
authors have already done all of the hard work for you. It'll act
almost like a regular dictionary, but is extremely easy to save to disk
and reload later.

This is why Python is called "batteries included". :)

Jeff Shannon
Technician/Programmer
Credit International

Jul 18 '05 #5

Larry Bates

With email messages they should be small enough so reading
them into memory isn't an issue so line-by-line processing
isn't indicated here.

Email messages have LOTS of punctuation in the other than
witespace between words. Just look at your email message
below. It contains:

greater than symbol ) parenthesis
.. periods
? question marks
, commas

Even text like: "html.So no line.." Periods with no
whitespace will be a problem string split would
return "html.So" as a word.

I really think you are going to need to use regex to
split this into "words" and even then the words may
be of questionable origin. See another response for
an example regex expression that might work. Constructs
like e.g. will return two words "e" and "g" (which
might be ok for your application).

Hope feedback at least helps.

Larry Bates
Manu wrote: hi,
1) How large are the files you are reading (e.g. can they
fit in memory)?

The files are email messages.
I will using the the builtin email module to extract only the content
type which is plain text or in html.So no line by line processing is
possible unless
i write my own parser for email.

2) Are the words in the file separated with some consistent
character (e.g. space, tab, csv, etc).

in the case of html mail i only extract the text and strip of the
tags.
Since this is regular text i expect no special seperators and as i
understand split() by default takes any whitespace character as
delimter.This will work fine for my purposes.

If not, preprocess the files and use shelve to save a
dictionary that has already been processed. When you

This is what i was planning to do.Once the processing is done for a
set of files they are never processed again.I was going to store the
dict as a string in a file and then use eval() to get it back.
Thanks
Manu

Jul 18 '05 #6

by: PatchFactory Support | last post by:

Description: Professional and easy-to-use patch building environment that can help you to create instant patch packages for software and file updating. Generated patch packages are small size...

Python

Where could I purchase a copy of Building Applications with Microsoft Access 97?

by: MLH | last post by:

Building Applications with Microsoft Access 97 is a publication I think I need. Is it available in book form? Is MicroSoft the sole vendor? Anybody got a copy they wanna sell???

Microsoft Access / VBA

Exporting Data to Excel and Word

by: mphanke | last post by:

Hi, I'm writting an application based on an SQL Server for Order Management. I have some data I want to export to Excel and Word, maybe some day I will implement a serial letter. The problem...

C# / C Sharp

Comparing Two Files line by line and word by word

by: Frost | last post by:

Hi All, I am a newbie i have written a c program on unix for line by line comparison for two files now could some one help on how i could do word by word comparison in case both lines have the...

C / C++

Shell to Word and load a specific document

by: Dixie | last post by:

Is there a way to shell to Microsoft Word from Access and load a specific template - using VBA? dixie

Microsoft Access / VBA

Looking for someone with experience building apps with multiple instancesof forms open

by: Bob Alston | last post by:

Looking for someone with experience building apps with multiple instances of forms open. I am building an app for a nonprofit organizations case workers. They provide services to the elderly. ...

Microsoft Access / VBA

how to modify my code to get every word & previos word from file? please help

by: alivip | last post by:

I write code to get most frequent words in the file I won't to implement bigram probability by modifying the code to do the following: How can I get every Token (word) and ...

Python

my code is trying to get double word from multube files but give errore please help

by: alivip | last post by:

How can I get every Token (word) and PreviousToken(Previous word) From multube files and frequency of each two word my code is trying to get all single word and double word (every Token (word) and...

Python

Building A Silverlight (2.0) Multi-File Uploader

by: Curtis Rutland | last post by:

Building A Silverlight (2.0) Multi-File Uploader All source code is C#. VB.NET source is coming soon. Note: This project requires Visual Studio 2008 SP1 or Visual Web Developer 2008 SP1 and...

ASP.NET

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Building a word list from multiple files

Similar topics