prefix search on a large file

Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like

---------------------------------------------------------------------------
sorted_list = sorted(file('th elist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswit h(line) ]
sorted_list = list(set(sorted _list) - (unneeded))
.....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.

Thank you in advance.

Oct 12 '06 #1

Subscribe Reply

1624

John Machin

js wrote:

Hello, list.

I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.

Let's say the list has a sentence below

1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "

2. "Python has been an important part of Google"

3. "important part of Google"

As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.

Are you 100% sure that you wnat to throw away the *longer* sentence?

(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)

So I decided to clean up the list.

I tried to do this simple brute-force manner, like

Please don't waste time and bandwith with "like"; post the exact code
that you executed.

>
---------------------------------------------------------------------------
sorted_list = sorted(file('th elist'), key=len)
for line in sorted_list[:]

# won't compile, missing a colon at end

unneeded = [ line2 for line2 in sorted_list[:] if line2.startswit h(line) ]
sorted_list = list(set(sorted _list) - (unneeded))

# can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_lis t" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.

....
---------------------------------------------------------------------------

This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.

Here's one way of doing it, tested to the extent shown:

C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

"""

sorted_list = sorted(file('th elist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswit h(line) ]
sorted_list = list(set(sorted _list) - set(unneeded))

"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswit h(line)]
slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)

ilist = sorted(data)
print "sorted input"; pp(ilist)

olist = []
for i in xrange(len(ilis t)-1, 0, -1):
if ilist[i].startswith(ili st[i-1]):
continue
olist.append(il ist[i])
olist.append(il ist[0])
print "output (keeping shorter)"; pp(olist)

olist = []
for i in xrange(len(ilis t)-1):
if ilist[i+1].startswith(ili st[i]):
continue
olist.append(il ist[i])
olist.append(il ist[-1])
print "output (keeping longer)"; pp(olist)
C:\junk>prefixd el.py
OP trial 1 input
['foo',
'food',
'food',
'zzzz',
'xyzzy',
'plugh',
'foo bar',
'foo bar baz',
'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
'foo bar',
'foo bar baz',
'food',
'food',
'plugh',
'xyzzy',
'xyzzy and plugh are magic',
'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']

HTH,
John

Oct 12 '06 #2

Thank you for the quick reply.

Here're the exact code I executed. (including your code)

#!/usr/bin/env python
from pprint import pprint as pp

data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]

data_sorted_by_ len = sorted(data, key=len)
data_sorted_by_ asc = sorted(data)
print "OP trial 1 input"; pp(data)

def prefixdel_stupi dly(alist):
for line in alist[:]:
unneeded = [no for no, line2 in enumerate(alist[1:]) if
line2.startswit h(line) and line != line2]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return alist

def prefixdel_recur sively(alist):
if len(alist) < 2:
return alist

unneeded = [no for no, line in enumerate(alist[1:]) if
line.startswith (alist[0])]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [alist[0]] + prefixdel_recur sively(alist[1:])

def prefixdel_by_jo hn(alist):
olist = []
for i in xrange(len(alis t)-1, 0, -1):
if alist[i].startswith(ali st[i-1]):
continue
olist.append(al ist[i])
olist.append(al ist[0])
return olist

if __name__ == '__main__':
from timeit import Timer
print sorted(prefixde l_stupidly(data _sorted_by_len[:]))
print sorted(prefixde l_recursively(d ata_sorted_by_l en[:]))
print sorted(prefixde l_by_john(data_ sorted_by_asc[:]))

t = Timer("prefixde l_stupidly(data _sorted_by_len) ", "from __main__
import prefixdel_stupi dly, data_sorted_by_ len")
print t.timeit(number =100000)
t = Timer("prefixde l_recursively(d ata_sorted_by_l en)", "from
__main__ import prefixdel_recur sively, data_sorted_by_ len")
print t.timeit(number =100000)
t = Timer("prefixde l_by_john(data_ sorted_by_asc)" , "from __main__
import prefixdel_by_jo hn, data_sorted_by_ asc")
print t.timeit(number =100000)

The output is the following:

$ python2.5 myprefixdel.py
OP trial 1 input
['foo bar baz',
'foo bar',
'foo',
'food',
'food',
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'food', 'plugh', 'xyzzy', 'zzzz']
1.17837095261
1.21182584763
0.737352132797

Your code is much faster than ones I wrote
but the result is a little bit different.('foo d' shoud be removed
because 'foo' is in the list)

As you pointed out, list[:] must be a big evil but without duplicating the list,
the problem become much harder to solve to me.

Any practical solution, anyone?

Oct 12 '06 #3

By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recur sively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist ) if line.startswith (first)]
adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recur sively(alist)
process stime
prefixdel_stupi dly : 11.9247150421
prefixdel_recur sively : 14.6975700855
prefixdel_recur sively2 : 0.408113956451
prefixdel_by_jo hn : 7.60227012634

Oct 12 '06 #4

Peter Otten

js wrote:

By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recur sively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist ) if
line.startswith (first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recur sively(alist)
process stime
prefixdel_stupi dly : 11.9247150421
prefixdel_recur sively : 14.6975700855
prefixdel_recur sively2 : 0.408113956451
prefixdel_by_jo hn : 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter

Oct 12 '06 #5

Scott David Daniels

Try:

def leaders(sorted_ list):
held = None
for phrase in held:
if held is None or not phrase.startswi th(held):
held = row
yield held

print list(leaders(so rted(data)))

--Scott David Daniels
sc***********@a cm.org

Oct 12 '06 #6

Tim Chase

def leaders(sorted_ list):

held = None
for phrase in held:

Ummm...that

for phrase in None:

doesn't do a whole lot other than give a traceback. :*)

I suspect you mean

for phrase in sorted_list:

no?

-tkc

Oct 12 '06 #7

Scott David Daniels

Tim Chase wrote:

> def leaders(sorted_ list):
held = None
for phrase in held:
... I suspect you mean
for phrase in sorted_list:
no?

Ooops -- renaming code before posting should get its own test.
You are absolutely correct.
--Scott David Daniels
sc***********@a cm.org

Oct 12 '06 #8

I did the test the way you suggested. It took not so long to realize
stupid mistakes I made. Thank you.

The following is the result of test with timeit(number=1 0000)
using fresh copy of the list for every iteration

0.331462860107
0.19401717186
0.186257839203
0.0762069225311

I tried my recursive-function to fix up my big-messed-list.
It stops immediately because of 'RuntimeError: maximum recursion limit exceeded'

I hope this trial-and-errors getting me good place...

anyway, thank you.

On 10/13/06, Peter Otten <__*******@web. dewrote:

js wrote:

By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.

def prefixdel_recur sively2(alist):
if len(alist) < 2:
return alist

first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist ) if
line.startswith (first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1

return [first] + prefixdel_recur sively(alist)
process stime
prefixdel_stupi dly : 11.9247150421
prefixdel_recur sively : 14.6975700855
prefixdel_recur sively2 : 0.408113956451
prefixdel_by_jo hn : 7.60227012634

Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.

I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?

Peter
--
http://mail.python.org/mailman/listinfo/python-list

Oct 13 '06 #9

Similar topics

8395

Number Prefix Lookup Table in MDB

by: John R | last post by:

Hi all, I'm trying to get my VB6 app to look up prefixes for a phone number out of an MDB file along with an associated price etc. For example the phone number could be 9802xxxx, and the MDB file will have the record 9802 with an associated price. I am used to connecting to databases via ADO, but my problem is that the MDB file may contain prefixes from 1 to 5 chars in length and i'm not sure how

Visual Basic 4 / 5 / 6

39352

String search vs regexp search

by: Anand Pillai | last post by:

To search a word in a group of words, say a paragraph or a web page, would a string search or a regexp search be faster? The string search would of course be, if str.find(substr) != -1: domything() And the regexp search assuming no case restriction would be,

Python

2870

python 2.3.3 setup.py: why adding sys.prefix to include_dir, libdir?

by: Holger Joukl | last post by:

Hi there, 2 questions regarding build/installation issues: 1. In the python 2.3.3 setup.py script, the detect_modules method of class PyBuildExt contains the following code: 253 if os.path.normpath(sys.prefix) != '/usr': 254 add_dir_to_list(self.compiler.library_dirs, 255 sysconfig.get_config_var("LIBDIR")) 256 add_dir_to_list(self.compiler.include_dirs,

Python

2542

Redirecting ./configure --prefix

by: Dan | last post by:

I suspect this isn't specifically a Python question, but I encountered it with Python so I thought I'd ask here. I'm running Linux (Fedora 2), and just downloaded the Python 2.4 kit. I did the following from my user account: ./configure --prefix=/some/private/dir --enable-shared make make test # all was okay make install

Python

2610

Search a binary file for a string... again! (its to slow)

by: spike | last post by:

Im writing a program to search for a string in a binary file. And it works. The problem is: It is sooo slow! how can i make it faster? It takes 27 seconds just to search a 5 meg file. I guess it has something to do with the strequal() function... Btw, thanks to all of you who answered last time! code: ------------------------------------------------------------------------- #include <stdio.h>

C / C++

49181

Fastest way to search text file for string

by: Julie | last post by:

What is the *fastest* way in .NET to search large on-disk text files (100+ MB) for a given string. The files are unindexed and unsorted, and for the purposes of my immediate requirements, can't be indexed/sorted. I don't want to load the entire file into physical memory, memory-mapped files are ok (and preferred). Speed/performance is a requirement -- the target is to locate the string in 10 seconds or less for a 100 MB file. The...

C# / C Sharp

2114

How to search files for text string most efficiently?

by: Jim | last post by:

Hello, I am working on a small windows application for a client, and as one of the functions they want a search that will let them enter a search string, then search a directory for all flies that contain that search string AND display the lines that contain the search string. They have windows ME, XP and 2000 systems. Does anyone have any ideas as to the most efficient way to do this?

Visual Basic .NET

7292

Force the prefix of the document element

by: BizTalk Benjamin | last post by:

Hi, I have an XmlDocument loaded from a memory stream. I set the document element prefix in this way XmlElement e = xDoc.DocumentElement; e.Prefix = "abc" When i simply write the document element to the command line it shows the prefix. However, it does not persist in the XmlDoc so when i save the XmlDoc into a file, the prefix is no longer there.

.NET Framework

2462

string search?

by: Bertram Trabant | last post by:

Hello, Im working on a little LAN game in the style of old text-only MUD's and would need to find a way to search for a string in a text file (for example for usernames). I know it works in the way of looking for the first letter, if matches the second and so on, but don't know how to write it. Any suggestions?

C / C++

8946

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9449

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9310

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

8186

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6735

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6031

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4550

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

2724

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

2180

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General