473,324 Members | 2,196 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

prefix search on a large file

js
Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like

---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
.....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.

Thank you in advance.
Oct 12 '06 #1
8 1609

js wrote:
Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
Are you 100% sure that you wnat to throw away the *longer* sentence?
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like
Please don't waste time and bandwith with "like"; post the exact code
that you executed.
>
---------------------------------------------------------------------------
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
# won't compile, missing a colon at end
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - (unneeded))
# can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_list" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.
....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.
Here's one way of doing it, tested to the extent shown:

C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

"""
sorted_list = sorted(file('thelist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswith(line) ]
sorted_list = list(set(sorted_list) - set(unneeded))
"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswith(line)]
slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)

ilist = sorted(data)
print "sorted input"; pp(ilist)

olist = []
for i in xrange(len(ilist)-1, 0, -1):
if ilist[i].startswith(ilist[i-1]):
continue
olist.append(ilist[i])
olist.append(ilist[0])
print "output (keeping shorter)"; pp(olist)

olist = []
for i in xrange(len(ilist)-1):
if ilist[i+1].startswith(ilist[i]):
continue
olist.append(ilist[i])
olist.append(ilist[-1])
print "output (keeping longer)"; pp(olist)
C:\junk>prefixdel.py
OP trial 1 input
['foo',
'food',
'food',
'zzzz',
'xyzzy',
'plugh',
'foo bar',
'foo bar baz',
'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
'foo bar',
'foo bar baz',
'food',
'food',
'plugh',
'xyzzy',
'xyzzy and plugh are magic',
'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']

HTH,
John

Oct 12 '06 #2
js
Thank you for the quick reply.

Here're the exact code I executed. (including your code)

#!/usr/bin/env python
from pprint import pprint as pp

data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

data_sorted_by_len = sorted(data, key=len)
data_sorted_by_asc = sorted(data)
print "OP trial 1 input"; pp(data)

def prefixdel_stupidly(alist):
for line in alist[:]:
unneeded = [no for no, line2 in enumerate(alist[1:]) if
line2.startswith(line) and line != line2]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return alist

def prefixdel_recursively(alist):
if len(alist) < 2:
return alist

unneeded = [no for no, line in enumerate(alist[1:]) if
line.startswith(alist[0])]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [alist[0]] + prefixdel_recursively(alist[1:])

def prefixdel_by_john(alist):
olist = []
for i in xrange(len(alist)-1, 0, -1):
if alist[i].startswith(alist[i-1]):
continue
olist.append(alist[i])
olist.append(alist[0])
return olist

if __name__ == '__main__':
from timeit import Timer
print sorted(prefixdel_stupidly(data_sorted_by_len[:]))
print sorted(prefixdel_recursively(data_sorted_by_len[:]))
print sorted(prefixdel_by_john(data_sorted_by_asc[:]))

t = Timer("prefixdel_stupidly(data_sorted_by_len)", "from __main__
import prefixdel_stupidly, data_sorted_by_len")
print t.timeit(number=100000)
t = Timer("prefixdel_recursively(data_sorted_by_len)", "from
__main__ import prefixdel_recursively, data_sorted_by_len")
print t.timeit(number=100000)
t = Timer("prefixdel_by_john(data_sorted_by_asc)", "from __main__
import prefixdel_by_john, data_sorted_by_asc")
print t.timeit(number=100000)

The output is the following:

$ python2.5 myprefixdel.py
OP trial 1 input
['foo bar baz',
'foo bar',
'foo',
'food',
'food',
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'food', 'plugh', 'xyzzy', 'zzzz']
1.17837095261
1.21182584763
0.737352132797

Your code is much faster than ones I wrote
but the result is a little bit different.('food' shoud be removed
because 'foo' is in the list)

As you pointed out, list[:] must be a big evil but without duplicating the list,
the problem become much harder to solve to me.

Any practical solution, anyone?
Oct 12 '06 #3
js
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if line.startswith(first)]
adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)
process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634
Oct 12 '06 #4
js wrote:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if
line.startswith(first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)
process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634
Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
Oct 12 '06 #5

Try:

def leaders(sorted_list):
held = None
for phrase in held:
if held is None or not phrase.startswith(held):
held = row
yield held

print list(leaders(sorted(data)))

--Scott David Daniels
sc***********@acm.org
Oct 12 '06 #6
def leaders(sorted_list):
held = None
for phrase in held:
Ummm...that

for phrase in None:

doesn't do a whole lot other than give a traceback. :*)

I suspect you mean

for phrase in sorted_list:

no?

-tkc
Oct 12 '06 #7
Tim Chase wrote:
> def leaders(sorted_list):
held = None
for phrase in held:
... I suspect you mean
for phrase in sorted_list:
no?
Ooops -- renaming code before posting should get its own test.
You are absolutely correct.
--Scott David Daniels
sc***********@acm.org
Oct 12 '06 #8
js
I did the test the way you suggested. It took not so long to realize
stupid mistakes I made. Thank you.

The following is the result of test with timeit(number=10000)
using fresh copy of the list for every iteration

0.331462860107
0.19401717186
0.186257839203
0.0762069225311

I tried my recursive-function to fix up my big-messed-list.
It stops immediately because of 'RuntimeError: maximum recursion limit exceeded'

I hope this trial-and-errors getting me good place...

anyway, thank you.

On 10/13/06, Peter Otten <__*******@web.dewrote:
js wrote:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recursively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist) if
line.startswith(first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recursively(alist)
process stime
prefixdel_stupidly : 11.9247150421
prefixdel_recursively : 14.6975700855
prefixdel_recursively2 : 0.408113956451
prefixdel_by_john : 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
--
http://mail.python.org/mailman/listinfo/python-list
Oct 13 '06 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: John R | last post by:
Hi all, I'm trying to get my VB6 app to look up prefixes for a phone number out of an MDB file along with an associated price etc. For example the phone number could be 9802xxxx, and the MDB...
10
by: Anand Pillai | last post by:
To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1:...
1
by: Holger Joukl | last post by:
Hi there, 2 questions regarding build/installation issues: 1. In the python 2.3.3 setup.py script, the detect_modules method of class PyBuildExt contains the following code: 253 if...
3
by: Dan | last post by:
I suspect this isn't specifically a Python question, but I encountered it with Python so I thought I'd ask here. I'm running Linux (Fedora 2), and just downloaded the Python 2.4 kit. I did the...
7
by: spike | last post by:
Im writing a program to search for a string in a binary file. And it works. The problem is: It is sooo slow! how can i make it faster? It takes 27 seconds just to search a 5 meg file. I guess it...
60
by: Julie | last post by:
What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string. The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't...
5
by: Jim | last post by:
Hello, I am working on a small windows application for a client, and as one of the functions they want a search that will let them enter a search string, then search a directory for all flies...
4
by: BizTalk Benjamin | last post by:
Hi, I have an XmlDocument loaded from a memory stream. I set the document element prefix in this way XmlElement e = xDoc.DocumentElement; e.Prefix = "abc" When i simply write the document...
33
by: Bertram Trabant | last post by:
Hello, Im working on a little LAN game in the style of old text-only MUD's and would need to find a way to search for a string in a text file (for example for usernames). I know it works in the way...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.