Looking for lots of words in lots of files

brad

Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?

Thanks

Jun 27 '08 #1

Subscribe Reply

1215

Diez B. Roggisch

brad wrote:

Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?

Use an indexer, like lucene (available as pylucene) or a database that
offers word-indices.

Diez

Jun 27 '08 #2

Calvin Spealman

Upload, wait, and google them.

Seriously tho, aside from using a real indexer, I would build a set
of the words I'm looking for, and then loop over each file, looping
over the words and doing quick checks for containment in the set. If
so, add to a dict of file names to list of words found until the list
hits 10 length. I don't think that would be a complicated solution
and it shouldn't be terrible at performance.

If you need to run this more than once, use an indexer.

If you only need to use it once, use an indexer, so you learn how for
next time.

On Jun 18, 2008, at 10:28 AM, brad wrote:

Just wondering if anyone has ever solved this efficiently... not
looking for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read
the files to see if some of the words are in the files. I can stop
reading a file once I find 10 of the words in it. It's easy for me
to do this with a few dozen words, but a thousand words is too
large for an RE and too inefficient to loop, etc. Any suggestions?

Thanks
--
http://mail.python.org/mailman/listinfo/python-list

Jun 27 '08 #3

Kris Kennaway

Calvin Spealman wrote:

Upload, wait, and google them.

Seriously tho, aside from using a real indexer, I would build a set of
the words I'm looking for, and then loop over each file, looping over
the words and doing quick checks for containment in the set. If so, add
to a dict of file names to list of words found until the list hits 10
length. I don't think that would be a complicated solution and it
shouldn't be terrible at performance.

If you need to run this more than once, use an indexer.

If you only need to use it once, use an indexer, so you learn how for
next time.

If you can't use an indexer, and performance matters, evaluate using
grep and a shell script. Seriously.

grep is a couple of orders of magnitude faster at pattern matching
strings in files (and especially regexps) than python is. Even if you
are invoking grep multiple times it is still likely to be faster than a
"maximally efficient" single pass over the file in python. This
realization was disappointing to me :)

Kris

Jun 27 '08 #4

Robert Bossy

brad wrote:

Just wondering if anyone has ever solved this efficiently... not
looking for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading
a file once I find 10 of the words in it. It's easy for me to do this
with a few dozen words, but a thousand words is too large for an RE
and too inefficient to loop, etc. Any suggestions?

The quick answer would be:
grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
where WORDLIST is a file containing the thousand words, one per line.

The more interesting answers would be to use either a suffix tree or an
Aho-Corasick graph.

- The suffix tree is a representation of the target string (your files)
that allows to search quickly for a word. Your problem would then be
solved by 1) building a suffix tree for your files, and 2) search for
each word sequentially in the suffix tree.

- The Aho-Corasick graph is a representation of the query word list that
allows fast scanning of the words on a target string. Your problem would
then be solved by 1) building an Aho-Corasick graph for the list of
words, and 2) scan sequentially each file.

The preference for using either one or the other depends on some details
of your problems: the expected size of target files, the rate of
overlaps between words in your list (are there common prefixes), will
you repeat the operation with another word list or another set of files,
etc. Personally, I'd lean towards Aho-Corasick, it is a matter of taste;
the kind of applications that comes to my mind makes it more practical.

Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
find modules for building both data structures in the python package index.

Cheers,
RB

Jun 27 '08 #5

Robert Bossy

I forgot to mention another way: put one thousand monkeys to work on it. ;)

RB

Robert Bossy wrote:

brad wrote:
>Just wondering if anyone has ever solved this efficiently... not
looking for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop
reading a file once I find 10 of the words in it. It's easy for me to
do this with a few dozen words, but a thousand words is too large for
an RE and too inefficient to loop, etc. Any suggestions?
The quick answer would be:
grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
where WORDLIST is a file containing the thousand words, one per line.

The more interesting answers would be to use either a suffix tree or
an Aho-Corasick graph.

- The suffix tree is a representation of the target string (your
files) that allows to search quickly for a word. Your problem would
then be solved by 1) building a suffix tree for your files, and 2)
search for each word sequentially in the suffix tree.

- The Aho-Corasick graph is a representation of the query word list
that allows fast scanning of the words on a target string. Your
problem would then be solved by 1) building an Aho-Corasick graph for
the list of words, and 2) scan sequentially each file.

The preference for using either one or the other depends on some
details of your problems: the expected size of target files, the rate
of overlaps between words in your list (are there common prefixes),
will you repeat the operation with another word list or another set of
files, etc. Personally, I'd lean towards Aho-Corasick, it is a matter
of taste; the kind of applications that comes to my mind makes it more
practical.

Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
find modules for building both data structures in the python package
index.

Cheers,
RB
--
http://mail.python.org/mailman/listinfo/python-list

Jun 27 '08 #6

Martin P. Hellwig

Kris Kennaway wrote:
<cut>

>
If you can't use an indexer, and performance matters, evaluate using
grep and a shell script. Seriously.

grep is a couple of orders of magnitude faster at pattern matching
strings in files (and especially regexps) than python is. Even if you
are invoking grep multiple times it is still likely to be faster than a
"maximally efficient" single pass over the file in python. This
realization was disappointing to me :)

Kris

Adding to this:
Then again, there is nothing wrong with wrapping grep from python and
revert to a pure python 'solution' if the system has no grep.
Reinventing the wheel is usually only practical if the existing ones
aren't round :-)

--
mph

Jun 27 '08 #7

Jeff McNeil

On Jun 18, 10:29*am, "Diez B. Roggisch" <de...@nospam.w eb.dewrote:

brad wrote:
Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?

Use an indexer, like lucene (available as pylucene) or a database that
offers word-indices.

Diez

I've been toying around with Nucular (http://nucular.sourceforge.net/)
a bit recently for some side projects. It's pure Python and seems to
work fairly well for my needs. I haven't pumped all that much data
into it, though.

Jun 27 '08 #8

Cong

On Jun 18, 11:01*pm, Kris Kennaway <k...@FreeBSD.o rgwrote:

Calvin Spealman wrote:
Upload, wait, and google them.

Seriously tho, aside from using a real indexer, I would build a set of
thewordsI'mlook ingfor, and then loop over each file, looping over
thewordsand doing quick checks for containment in the set. If so, add
to a dict of file names to list ofwordsfound until the list hits 10
length. I don't think that would be a complicated solution and it
shouldn't be terrible at performance.

If you need to run this more than once, use an indexer.

If you only need to use it once, use an indexer, so you learn how for
next time.

If you can't use an indexer, and performance matters, evaluate using
grep and a shell script. *Seriously.

grep is a couple of orders of magnitude faster at pattern matching
strings infiles(and especially regexps) than python is. *Even if you
are invoking grep multiple times it is still likely to be faster than a
"maximally efficient" single pass over the file in python. *This
realization was disappointing to me :)

Kris

Alternatively, if you don't feel like writing shell scripts, you can
write a Python program which auto-generate the desired shell script
which utilizes grep. E.g. use Python for generating the file list
which is passed to grep as arguments. ;-P

Jun 27 '08 #9

Bruno Desthuilliers

brad a écrit :

Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.

I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?

Full text indexing.

Jun 27 '08 #10

Similar topics

1579

lots of variables and speed?

by: Shenron | last post by:

Hello, I'm trying to translate my website in some languages, therefore I include a file which contains all words. This is something like that: $hello_english="Hello"; in lang_en.php $hello_francais="Salut"; in lang_fr.php $hello_deutch="Heil"; in lang_de.php ....

PHP

3458

Looking for a good way to detect server platform

by: Colin McKinnon | last post by:

Hi, I'm currently writing a wee library tool in PHP. It generates a few (actually a lot) of temporary files and directories, so I need to setup some garbage collection. I can do this a lot more effeciently from the shell on a POSIX system than writing lots of PHP code, but the former wont to work if someone tries to run it on a different platform. (If on Unix/Linux/... I can use 'at now' to sperate the process group, and use 'find' to...

PHP

1727

Looking for minimal SQL

by: Marco Aschwanden | last post by:

Hi I would like to develop an app that is (more or less) database independet. Python DB API helps when masking "parameters" of sql statements. The db driver cares for the correct conversion of a date, text, etc. This already is a big step into the right direction. The next step would be to use the least common denominator of all sql dialects and do without all the sql goodies that the dialects offer... and for this part I am wondering...

Python

1610

Reclaiming (lots of) memory

by: Thomas Rast | last post by:

Hello everyone My scenario is somewhat involved, so if you're in a hurry, here's an abstract: I have a daemon that needs about 80MB of RAM to build its internal data structures, but can pack them to 20MB for the rest of its lifetime. The other 60MB are never returned to the operating system. I've thought of building and packing the structures in a fork()ed process, then piping them over to the main part; is there an easier way to get...

Python

2582

College kid, looking for advice.

by: Jason Daly | last post by:

I'm a freshman at college as a computer science major. I'm not sure it has what I want. Does anyone know if a major commonly exists in web design (focusing in server side languages)? I want to program for the internet, but don't know where to get all of my information from to be the most knowledgeable I can be. Do i find what i'm looking for in some class somewhere? if so where do i look? or do i just buy all the asp, php, xml, etc...

ASP / Active Server Pages

3169

Any progammers looking for a killer app to develop? How about a voice enabled forum?

by: Matt | last post by:

Any progammers looking for a killer app to develop? How about a voice enabled forum? One of the most powerful, exciting, and engrossing experiences on the Internet is the Forum. The first great Internet forums were the Usenet newsgroups. Usenet is still a powerful force, but many different types of forums are also very popular (such as message boards like Vbulliten and XMBforum). I love forums. Love em love em love em. My web site...

Oracle Database

579

Looking for a Good HEX editor

by: M.Siler | last post by:

I didn't know what group to post this in... I'm looking for a good hex editor. One that would permit me to view two files at the same time and as I move the position in one it would also move it in the other. I'm trying to compare the position location of two files.

.NET Framework

1189

Looking for opinion on code size for presention tier asp.net application.

by: JollyK | last post by:

Hello everyone... I have created a user-control that has a fairly complex datagrid (with template columns) which includes localization, custom paging, sorting, filtering, searching, caching, options to add, edit, and delete record, and then enabling and disabling links inside the datagrid based on different conditions. Everything is working fine now. I have hand-coded everything and the total length of my code-behind file is around 1900...

ASP.NET

2267

looking for local / remote directory path solution

by: enrique | last post by:

Hello everyone, I'm looking for a "directory path" solution that will allows me to test my app locally and then test on remote web server without having to update my web.config file each time I modify the file. In other words I'm storing the paths (for centralizing purposes) as custom <appSettings> in the web.config. expample (for local testing): <add key="imagePath" value="http://localhost/app_folder/10_media/images/" />

.NET Framework

9541

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10482

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10251

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

10225

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9072

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

5463

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

4139

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3759

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2938

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General