Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?
Thanks 9 1190
brad wrote:
Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?
Use an indexer, like lucene (available as pylucene) or a database that
offers word-indices.
Diez
Upload, wait, and google them.
Seriously tho, aside from using a real indexer, I would build a set
of the words I'm looking for, and then loop over each file, looping
over the words and doing quick checks for containment in the set. If
so, add to a dict of file names to list of words found until the list
hits 10 length. I don't think that would be a complicated solution
and it shouldn't be terrible at performance.
If you need to run this more than once, use an indexer.
If you only need to use it once, use an indexer, so you learn how for
next time.
On Jun 18, 2008, at 10:28 AM, brad wrote:
Just wondering if anyone has ever solved this efficiently... not
looking for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read
the files to see if some of the words are in the files. I can stop
reading a file once I find 10 of the words in it. It's easy for me
to do this with a few dozen words, but a thousand words is too
large for an RE and too inefficient to loop, etc. Any suggestions?
Thanks
-- http://mail.python.org/mailman/listinfo/python-list
Calvin Spealman wrote:
Upload, wait, and google them.
Seriously tho, aside from using a real indexer, I would build a set of
the words I'm looking for, and then loop over each file, looping over
the words and doing quick checks for containment in the set. If so, add
to a dict of file names to list of words found until the list hits 10
length. I don't think that would be a complicated solution and it
shouldn't be terrible at performance.
If you need to run this more than once, use an indexer.
If you only need to use it once, use an indexer, so you learn how for
next time.
If you can't use an indexer, and performance matters, evaluate using
grep and a shell script. Seriously.
grep is a couple of orders of magnitude faster at pattern matching
strings in files (and especially regexps) than python is. Even if you
are invoking grep multiple times it is still likely to be faster than a
"maximally efficient" single pass over the file in python. This
realization was disappointing to me :)
Kris
brad wrote:
Just wondering if anyone has ever solved this efficiently... not
looking for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading
a file once I find 10 of the words in it. It's easy for me to do this
with a few dozen words, but a thousand words is too large for an RE
and too inefficient to loop, etc. Any suggestions?
The quick answer would be:
grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
where WORDLIST is a file containing the thousand words, one per line.
The more interesting answers would be to use either a suffix tree or an
Aho-Corasick graph.
- The suffix tree is a representation of the target string (your files)
that allows to search quickly for a word. Your problem would then be
solved by 1) building a suffix tree for your files, and 2) search for
each word sequentially in the suffix tree.
- The Aho-Corasick graph is a representation of the query word list that
allows fast scanning of the words on a target string. Your problem would
then be solved by 1) building an Aho-Corasick graph for the list of
words, and 2) scan sequentially each file.
The preference for using either one or the other depends on some details
of your problems: the expected size of target files, the rate of
overlaps between words in your list (are there common prefixes), will
you repeat the operation with another word list or another set of files,
etc. Personally, I'd lean towards Aho-Corasick, it is a matter of taste;
the kind of applications that comes to my mind makes it more practical.
Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
find modules for building both data structures in the python package index.
Cheers,
RB
I forgot to mention another way: put one thousand monkeys to work on it. ;)
RB
Robert Bossy wrote:
brad wrote:
>Just wondering if anyone has ever solved this efficiently... not looking for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read the files to see if some of the words are in the files. I can stop reading a file once I find 10 of the words in it. It's easy for me to do this with a few dozen words, but a thousand words is too large for an RE and too inefficient to loop, etc. Any suggestions?
The quick answer would be:
grep -F -f WORDLIST FILE1 FILE2 ... FILE1000
where WORDLIST is a file containing the thousand words, one per line.
The more interesting answers would be to use either a suffix tree or
an Aho-Corasick graph.
- The suffix tree is a representation of the target string (your
files) that allows to search quickly for a word. Your problem would
then be solved by 1) building a suffix tree for your files, and 2)
search for each word sequentially in the suffix tree.
- The Aho-Corasick graph is a representation of the query word list
that allows fast scanning of the words on a target string. Your
problem would then be solved by 1) building an Aho-Corasick graph for
the list of words, and 2) scan sequentially each file.
The preference for using either one or the other depends on some
details of your problems: the expected size of target files, the rate
of overlaps between words in your list (are there common prefixes),
will you repeat the operation with another word list or another set of
files, etc. Personally, I'd lean towards Aho-Corasick, it is a matter
of taste; the kind of applications that comes to my mind makes it more
practical.
Btw, the `grep -F -f` combo builds an Aho-Corasick graph. Also you can
find modules for building both data structures in the python package
index.
Cheers,
RB
-- http://mail.python.org/mailman/listinfo/python-list
Kris Kennaway wrote:
<cut>
>
If you can't use an indexer, and performance matters, evaluate using
grep and a shell script. Seriously.
grep is a couple of orders of magnitude faster at pattern matching
strings in files (and especially regexps) than python is. Even if you
are invoking grep multiple times it is still likely to be faster than a
"maximally efficient" single pass over the file in python. This
realization was disappointing to me :)
Kris
Adding to this:
Then again, there is nothing wrong with wrapping grep from python and
revert to a pure python 'solution' if the system has no grep.
Reinventing the wheel is usually only practical if the existing ones
aren't round :-)
--
mph
On Jun 18, 10:29*am, "Diez B. Roggisch" <de...@nospam.web.dewrote:
brad wrote:
Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?
Use an indexer, like lucene (available as pylucene) or a database that
offers word-indices.
Diez
I've been toying around with Nucular ( http://nucular.sourceforge.net/)
a bit recently for some side projects. It's pure Python and seems to
work fairly well for my needs. I haven't pumped all that much data
into it, though.
On Jun 18, 11:01*pm, Kris Kennaway <k...@FreeBSD.orgwrote:
Calvin Spealman wrote:
Upload, wait, and google them.
Seriously tho, aside from using a real indexer, I would build a set of
thewordsI'mlookingfor, and then loop over each file, looping over
thewordsand doing quick checks for containment in the set. If so, add
to a dict of file names to list ofwordsfound until the list hits 10
length. I don't think that would be a complicated solution and it
shouldn't be terrible at performance.
If you need to run this more than once, use an indexer.
If you only need to use it once, use an indexer, so you learn how for
next time.
If you can't use an indexer, and performance matters, evaluate using
grep and a shell script. *Seriously.
grep is a couple of orders of magnitude faster at pattern matching
strings infiles(and especially regexps) than python is. *Even if you
are invoking grep multiple times it is still likely to be faster than a
"maximally efficient" single pass over the file in python. *This
realization was disappointing to me :)
Kris
Alternatively, if you don't feel like writing shell scripts, you can
write a Python program which auto-generate the desired shell script
which utilizes grep. E.g. use Python for generating the file list
which is passed to grep as arguments. ;-P
brad a écrit :
Just wondering if anyone has ever solved this efficiently... not looking
for specific solutions tho... just ideas.
I have one thousand words and one thousand files. I need to read the
files to see if some of the words are in the files. I can stop reading a
file once I find 10 of the words in it. It's easy for me to do this with
a few dozen words, but a thousand words is too large for an RE and too
inefficient to loop, etc. Any suggestions?
Full text indexing. This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Shenron |
last post by:
Hello,
I'm trying to translate my website in some languages, therefore I
include a file which contains all words. This is something like that:
$hello_english="Hello"; in lang_en.php...
|
by: Colin McKinnon |
last post by:
Hi,
I'm currently writing a wee library tool in PHP. It generates a few
(actually a lot) of temporary files and directories, so I need to setup
some garbage collection. I can do this a lot more...
|
by: Marco Aschwanden |
last post by:
Hi
I would like to develop an app that is (more or less) database independet.
Python DB API helps when masking "parameters" of sql statements. The db
driver cares for the correct conversion of a...
|
by: Thomas Rast |
last post by:
Hello everyone
My scenario is somewhat involved, so if you're in a hurry, here's an
abstract: I have a daemon that needs about 80MB of RAM to build its
internal data structures, but can pack...
|
by: Jason Daly |
last post by:
I'm a freshman at college as a computer science major.
I'm not sure it has what I want. Does anyone know if a major commonly
exists in web design (focusing in server side languages)? I want to...
|
by: Matt |
last post by:
Any progammers looking for a killer app to develop? How about a voice
enabled forum?
One of the most powerful, exciting, and engrossing experiences on the
Internet is the Forum. The first great...
|
by: M.Siler |
last post by:
I didn't know what group to post this in... I'm looking for a good hex
editor. One that would permit me to view two files at the same time and as I
move the position in one it would also move it in...
|
by: JollyK |
last post by:
Hello everyone...
I have created a user-control that has a fairly complex datagrid (with
template columns) which includes localization, custom paging, sorting,
filtering, searching, caching,...
|
by: enrique |
last post by:
Hello everyone,
I'm looking for a "directory path" solution that will allows me to test my
app locally and then test on remote web server without having to update my
web.config file each time I...
|
by: Faith0G |
last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
|
by: ryjfgjl |
last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: aa123db |
last post by:
Variable and constants
Use var or let for variables and const fror constants.
Var foo ='bar';
Let foo ='bar';const baz ='bar';
Functions
function $name$ ($parameters$) {
}
...
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: ryjfgjl |
last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
| |