472,337 Members | 1,471 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,337 software developers and data experts.

prefix search on a large file

js
Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like

---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
.....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.

Thank you in advance.
Oct 12 '06 #1
8 1552

js wrote:
Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
Are you 100% sure that you wnat to throw away the *longer* sentence?
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like
Please don't waste time and bandwith with "like"; post the exact code
that you executed.
>
---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
# won't compile, missing a colon at end
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
# can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_list" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.
....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.
Here's one way of doing it, tested to the extent shown:

C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

"""
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - set(unneeded))
"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswith(line)]
slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)

ilist = sorted(data)
print "sorted input"; pp(ilist)

olist = []
for i in xrange(len(ilist)-1, 0, -1):
if ilist[i].startswith(ilist[i-1]):
continue
olist.append(ilist[i])
olist.append(ilist[0])
print "output (keeping shorter)"; pp(olist)

olist = []
for i in xrange(len(ilist)-1):
if ilist[i+1].startswith(ilist[i]):
continue
olist.append(ilist[i])
olist.append(ilist[-1])
print "output (keeping longer)"; pp(olist)
C:\junk>prefixdel.py
OP trial 1 input
['foo',
'food',
'food',
'zzzz',
'xyzzy',
'plugh',
'foo bar',
'foo bar baz',
'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
'foo bar',
'foo bar baz',
'food',
'food',
'plugh',
'xyzzy',
'xyzzy and plugh are magic',
'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']

HTH,
John

Oct 12 '06 #2
js
Thank you for the quick reply.

Here're the exact code I executed. (including your code)

#!/usr/bin/env python
from pprint import pprint as pp

data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

data_sorted_by_len = sorted(data, key=len)
data_sorted_by_asc = sorted(data)
print "OP trial 1 input"; pp(data)

def prefixdel_stupidly(alist):
for line in alist[:]:
unneeded = [no for no, line2 in enumerate(alist[1:]) if
line2.startswith(line) and line != line2]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return alist

def prefixdel_recursively(alist):
if len(alist) < 2:
return alist

unneeded = [no for no, line in enumerate(alist[1:]) if
line.startswith(alist[0])]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [alist[0]] + prefixdel_recursively(alist[1:])

def prefixdel_by_john(alist):
olist = []
for i in xrange(len(alist)-1, 0, -1):
if alist[i].startswith(alist[i-1]):
continue
olist.append(alist[i])
olist.append(alist[0])
return olist

if __name__ == '__main__':
from timeit import Timer
print sorted(prefixdel_stupidly(data_sorted_by_len[:]))
print sorted(prefixdel_recursively(data_sorted_by_len[:]))
print sorted(prefixdel_by_john(data_sorted_by_asc[:]))

t = Timer("prefixdel_stupidly(data_sorted_by_len)", "from __main__
import prefixdel_stupidly, data_sorted_by_len")
print t.timeit(number=100000)
t = Timer("prefixdel_recursively(data_sorted_by_len)", "from
__main__ import prefixdel_recursively, data_sorted_by_len")
print t.timeit(number=100000)
t = Timer("prefixdel_by_john(data_sorted_by_asc)", "from __main__
import prefixdel_by_john, data_sorted_by_asc")
print t.timeit(number=100000)

The output is the following:

$ python2.5 myprefixdel.py
OP trial 1 input
['foo bar baz',
'foo bar',
'foo',
'food',
'food',
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'food', 'plugh', 'xyzzy', 'zzzz']
1.17837095261
1.21182584763
0.737352132797

Your code is much faster than ones I wrote
but the result is a little bit different.('food' shoud be removed
because 'foo' is in the list)

As you pointed out, list[:] must be a big evil but without duplicating the list,
the problem become much harder to solve to me.

Any practical solution, anyone?
Oct 12 '06 #3
js
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if line.startswith(first)]
adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)
process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634
Oct 12 '06 #4
js wrote:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if
line.startswith(first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)
process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634
Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
Oct 12 '06 #5

Try:

def leaders(sorted_list):
held = None
for phrase in held:
if held is None or not phrase.startswith(held):
held = row
yield held

print list(leaders(sorted(data)))

--Scott David Daniels
sc***********@acm.org
Oct 12 '06 #6
def leaders(sorted_list):
held = None
for phrase in held:
Ummm...that

for phrase in None:

doesn't do a whole lot other than give a traceback. :*)

I suspect you mean

for phrase in sorted_list:

no?

-tkc
Oct 12 '06 #7
Tim Chase wrote:
> def leaders(sorted_list):
held = None
for phrase in held:
... I suspect you mean
for phrase in sorted_list:
no?
Ooops -- renaming code before posting should get its own test.
You are absolutely correct.
--Scott David Daniels
sc***********@acm.org
Oct 12 '06 #8
js
I did the test the way you suggested. It took not so long to realize
stupid mistakes I made. Thank you.

The following is the result of test with timeit(number=10000)
using fresh copy of the list for every iteration

0.331462860107
0.19401717186
0.186257839203
0.0762069225311

I tried my recursive-function to fix up my big-messed-list.
It stops immediately because of 'RuntimeError: maximum recursion limit exceeded'

I hope this trial-and-errors getting me good place...

anyway, thank you.

On 10/13/06, Peter Otten <__*******@web.dewrote:
js wrote:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if
line.startswith(first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)
process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
--
http://mail.python.org/mailman/listinfo/python-list
Oct 13 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: John R | last post by:
Hi all, I'm trying to get my VB6 app to look up prefixes for a phone number out of an MDB file along with an associated price etc. For example...
10
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would...
1
by: Holger Joukl | last post by:
Hi there, 2 questions regarding build/installation issues: 1. In the python 2.3.3 setup.py script, the detect_modules method of class PyBuildExt...
3
by: Dan | last post by:
I suspect this isn't specifically a Python question, but I encountered it with Python so I thought I'd ask here. I'm running Linux (Fedora 2),...
7
by: spike | last post by:
Im writing a program to search for a string in a binary file. And it works. The problem is: It is sooo slow! how can i make it faster? It takes 27...
60
by: Julie | last post by:
What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string. The files are unindexed and unsorted, and for...
5
by: Jim | last post by:
Hello, I am working on a small windows application for a client, and as one of the functions they want a search that will let them enter a search...
4
by: BizTalk Benjamin | last post by:
Hi, I have an XmlDocument loaded from a memory stream. I set the document element prefix in this way XmlElement e = xDoc.DocumentElement;...
33
by: Bertram Trabant | last post by:
Hello, Im working on a little LAN game in the style of old text-only MUD's and would need to find a way to search for a string in a text file (for...
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
by: teenabhardwaj | last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
0
by: Kemmylinns12 | last post by:
Blockchain technology has emerged as a transformative force in the business world, offering unprecedented opportunities for innovation and...
0
by: CD Tom | last post by:
This happens in runtime 2013 and 2016. When a report is run and then closed a toolbar shows up and the only way to get it to go away is to right...
0
by: CD Tom | last post by:
This only shows up in access runtime. When a user select a report from my report menu when they close the report they get a menu I've called Add-ins...
0
by: antdb | last post by:
Ⅰ. Advantage of AntDB: hyper-convergence + streaming processing engine In the overall architecture, a new "hyper-convergence" concept was...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...
0
by: Arjunsri | last post by:
I have a Redshift database that I need to use as an import data source. I have configured the DSN connection using the server, port, database, and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.