Hello, list.
I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.
Let's say the list has a sentence below
1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "
2. "Python has been an important part of Google"
3. "important part of Google"
As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)
So I decided to clean up the list.
I tried to do this simple brute-force manner, like
---------------------------------------------------------------------------
sorted_list = sorted(file('th elist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswit h(line) ]
sorted_list = list(set(sorted _list) - (unneeded))
.....
---------------------------------------------------------------------------
This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.
I'm not familiar with algorithms/data structure and large-scale data processing,
so any advice, suggestions and recommendations will be appreciated.
Thank you in advance. 8 1624
js wrote:
Hello, list.
I have a list of sentence in text files that I use to filter-out some data.
I managed the list so badly that now it's become literally a mess.
Let's say the list has a sentence below
1. "Python has been an important part of Google since the beginning,
and remains so as the system grows and evolves. "
2. "Python has been an important part of Google"
3. "important part of Google"
As you see sentence 2 is a subset of sentence 1
so I don't need to have sentence 1 on the list.
Are you 100% sure that you wnat to throw away the *longer* sentence?
(For some reason, it's no problem to have sentence 3.
Only sentence that has the "same prefix part" is the one I want to remove)
So I decided to clean up the list.
I tried to do this simple brute-force manner, like
Please don't waste time and bandwith with "like"; post the exact code
that you executed.
>
---------------------------------------------------------------------------
sorted_list = sorted(file('th elist'), key=len)
for line in sorted_list[:]
# won't compile, missing a colon at end
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswit h(line) ]
sorted_list = list(set(sorted _list) - (unneeded))
# can't do set - list
# Presuming you mean - set(unneeded), but then produces an empty list
(see later).
# Naming the output "sorted_lis t" is misleading advertising.
# With sorted_list[:] in an inner loop, it's no wonder that it runs
slowly.
# Test your code on small samples of data and ensure that it is working
correctly before you run it on huge amounts of data.
# Be careful of end cases; note that in my solutions below, the first
or last item needs special handling.
....
---------------------------------------------------------------------------
This is so slow and not so helpful because the list is
so big(more than 100M bytes and has about 3 million lines)
and I have more than 100 lists.
Here's one way of doing it, tested to the extent shown:
C:\junk>type prefixdel.py
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]
"""
sorted_list = sorted(file('th elist'), key=len)
for line in sorted_list[:]
unneeded = [ line2 for line2 in sorted_list[:] if line2.startswit h(line) ]
sorted_list = list(set(sorted _list) - set(unneeded))
"""
slist = sorted(data, key=len)
print "OP trial 1 input"; pp(slist)
for line in slist[:]:
unneeded = [line2 for line2 in slist[:] if line2.startswit h(line)]
slist = list(set(slist) - set(unneeded))
print "OP trial 1 output"; pp(slist)
ilist = sorted(data)
print "sorted input"; pp(ilist)
olist = []
for i in xrange(len(ilis t)-1, 0, -1):
if ilist[i].startswith(ili st[i-1]):
continue
olist.append(il ist[i])
olist.append(il ist[0])
print "output (keeping shorter)"; pp(olist)
olist = []
for i in xrange(len(ilis t)-1):
if ilist[i+1].startswith(ili st[i]):
continue
olist.append(il ist[i])
olist.append(il ist[-1])
print "output (keeping longer)"; pp(olist)
C:\junk>prefixd el.py
OP trial 1 input
['foo',
'food',
'food',
'zzzz',
'xyzzy',
'plugh',
'foo bar',
'foo bar baz',
'xyzzy and plugh are magic']
OP trial 1 output
[]
sorted input
['foo',
'foo bar',
'foo bar baz',
'food',
'food',
'plugh',
'xyzzy',
'xyzzy and plugh are magic',
'zzzz']
output (keeping shorter)
['zzzz', 'xyzzy', 'plugh', 'food', 'foo']
output (keeping longer)
['foo bar baz', 'food', 'plugh', 'xyzzy and plugh are magic', 'zzzz']
HTH,
John
Thank you for the quick reply.
Here're the exact code I executed. (including your code)
#!/usr/bin/env python
from pprint import pprint as pp
data = [
'foo bar baz',
'foo bar',
'foo',
'food',
'food', # duplicate
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz',
]
data_sorted_by_ len = sorted(data, key=len)
data_sorted_by_ asc = sorted(data)
print "OP trial 1 input"; pp(data)
def prefixdel_stupi dly(alist):
for line in alist[:]:
unneeded = [no for no, line2 in enumerate(alist[1:]) if
line2.startswit h(line) and line != line2]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return alist
def prefixdel_recur sively(alist):
if len(alist) < 2:
return alist
unneeded = [no for no, line in enumerate(alist[1:]) if
line.startswith (alist[0])]
adjust=1
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return [alist[0]] + prefixdel_recur sively(alist[1:])
def prefixdel_by_jo hn(alist):
olist = []
for i in xrange(len(alis t)-1, 0, -1):
if alist[i].startswith(ali st[i-1]):
continue
olist.append(al ist[i])
olist.append(al ist[0])
return olist
if __name__ == '__main__':
from timeit import Timer
print sorted(prefixde l_stupidly(data _sorted_by_len[:]))
print sorted(prefixde l_recursively(d ata_sorted_by_l en[:]))
print sorted(prefixde l_by_john(data_ sorted_by_asc[:]))
t = Timer("prefixde l_stupidly(data _sorted_by_len) ", "from __main__
import prefixdel_stupi dly, data_sorted_by_ len")
print t.timeit(number =100000)
t = Timer("prefixde l_recursively(d ata_sorted_by_l en)", "from
__main__ import prefixdel_recur sively, data_sorted_by_ len")
print t.timeit(number =100000)
t = Timer("prefixde l_by_john(data_ sorted_by_asc)" , "from __main__
import prefixdel_by_jo hn, data_sorted_by_ asc")
print t.timeit(number =100000)
The output is the following:
$ python2.5 myprefixdel.py
OP trial 1 input
['foo bar baz',
'foo bar',
'foo',
'food',
'food',
'xyzzy',
'plugh',
'xyzzy and plugh are magic',
'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'plugh', 'xyzzy', 'zzzz']
['foo', 'food', 'plugh', 'xyzzy', 'zzzz']
1.17837095261
1.21182584763
0.737352132797
Your code is much faster than ones I wrote
but the result is a little bit different.('foo d' shoud be removed
because 'foo' is in the list)
As you pointed out, list[:] must be a big evil but without duplicating the list,
the problem become much harder to solve to me.
Any practical solution, anyone?
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.
def prefixdel_recur sively2(alist):
if len(alist) < 2:
return alist
first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist ) if line.startswith (first)]
adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return [first] + prefixdel_recur sively(alist)
process stime
prefixdel_stupi dly : 11.9247150421
prefixdel_recur sively : 14.6975700855
prefixdel_recur sively2 : 0.408113956451
prefixdel_by_jo hn : 7.60227012634
js wrote:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.
def prefixdel_recur sively2(alist):
if len(alist) < 2:
return alist
first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist ) if
line.startswith (first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return [first] + prefixdel_recur sively(alist)
process stime
prefixdel_stupi dly : 11.9247150421
prefixdel_recur sively : 14.6975700855
prefixdel_recur sively2 : 0.408113956451
prefixdel_by_jo hn : 7.60227012634
Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.
I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?
Peter
Try:
def leaders(sorted_ list):
held = None
for phrase in held:
if held is None or not phrase.startswi th(held):
held = row
yield held
print list(leaders(so rted(data)))
--Scott David Daniels sc***********@a cm.org
def leaders(sorted_ list):
held = None
for phrase in held:
Ummm...that
for phrase in None:
doesn't do a whole lot other than give a traceback. :*)
I suspect you mean
for phrase in sorted_list:
no?
-tkc
Tim Chase wrote:
> def leaders(sorted_ list): held = None for phrase in held:
... I suspect you mean
for phrase in sorted_list:
no?
Ooops -- renaming code before posting should get its own test.
You are absolutely correct.
--Scott David Daniels sc***********@a cm.org
I did the test the way you suggested. It took not so long to realize
stupid mistakes I made. Thank you.
The following is the result of test with timeit(number=1 0000)
using fresh copy of the list for every iteration
0.331462860107
0.19401717186
0.186257839203
0.0762069225311
I tried my recursive-function to fix up my big-messed-list.
It stops immediately because of 'RuntimeError: maximum recursion limit exceeded'
I hope this trial-and-errors getting me good place...
anyway, thank you.
On 10/13/06, Peter Otten <__*******@web. dewrote:
js wrote:
By eliminating list cloning, my function got much faster than before.
I really appreciate you, John.
def prefixdel_recur sively2(alist):
if len(alist) < 2:
return alist
first = alist.pop(0)
unneeded = [no for no, line in enumerate(alist ) if
line.startswith (first)] adjust=0
for i in unneeded:
del alist[i+adjust]
adjust -= 1
return [first] + prefixdel_recur sively(alist)
process stime
prefixdel_stupi dly : 11.9247150421
prefixdel_recur sively : 14.6975700855
prefixdel_recur sively2 : 0.408113956451
prefixdel_by_jo hn : 7.60227012634
Those are suspicious results. Time it again with number=1, or a fresh copy
of the data for every iteration.
I also have my doubts whether sorting by length is a good idea. To take it
to the extreme: what if your data file contains an empty line?
Peter
-- http://mail.python.org/mailman/listinfo/python-list This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: John R |
last post by:
Hi all,
I'm trying to get my VB6 app to look up prefixes for a phone number out of
an MDB file along with an associated price etc.
For example the phone number could be 9802xxxx, and the MDB file will have
the record 9802 with an associated price.
I am used to connecting to databases via ADO, but my problem is that the MDB
file may contain prefixes from 1 to 5 chars in length and i'm not sure how
|
by: Anand Pillai |
last post by:
To search a word in a group of words, say a paragraph or a web page,
would a string search or a regexp search be faster?
The string search would of course be,
if str.find(substr) != -1:
domything()
And the regexp search assuming no case restriction would be,
|
by: Holger Joukl |
last post by:
Hi there,
2 questions regarding build/installation issues:
1. In the python 2.3.3 setup.py script, the detect_modules method of
class PyBuildExt contains the following code:
253 if os.path.normpath(sys.prefix) != '/usr':
254 add_dir_to_list(self.compiler.library_dirs,
255 sysconfig.get_config_var("LIBDIR"))
256 add_dir_to_list(self.compiler.include_dirs,
|
by: Dan |
last post by:
I suspect this isn't specifically a Python question, but I
encountered it with Python so I thought I'd ask here.
I'm running Linux (Fedora 2), and just downloaded the Python 2.4
kit. I did the following from my user account:
./configure --prefix=/some/private/dir --enable-shared
make
make test # all was okay
make install
|
by: spike |
last post by:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...
Btw, thanks to all of you who answered last time!
code:
-------------------------------------------------------------------------
#include <stdio.h>
| |
by: Julie |
last post by:
What is the *fastest* way in .NET to search large on-disk text files (100+ MB)
for a given string.
The files are unindexed and unsorted, and for the purposes of my immediate
requirements, can't be indexed/sorted.
I don't want to load the entire file into physical memory, memory-mapped files
are ok (and preferred). Speed/performance is a requirement -- the target is to
locate the string in 10 seconds or less for a 100 MB file. The...
|
by: Jim |
last post by:
Hello,
I am working on a small windows application for a client, and as one of the
functions they want a search that will let them enter a search string, then
search a directory for all flies that contain that search string AND display
the lines that contain the search string.
They have windows ME, XP and 2000 systems.
Does anyone have any ideas as to the most efficient way to do this?
|
by: BizTalk Benjamin |
last post by:
Hi, I have an XmlDocument loaded from a memory stream. I set the document
element prefix in this way
XmlElement e = xDoc.DocumentElement;
e.Prefix = "abc"
When i simply write the document element to the command line it shows the
prefix. However, it does not persist in the XmlDoc so when i save the XmlDoc
into a file, the prefix is no longer there.
|
by: Bertram Trabant |
last post by:
Hello, Im working on a little LAN game in the style of old text-only
MUD's and would need to find a way to search for a string in a text
file (for example for usernames). I know it works in the way of looking
for the first letter, if matches the second and so on, but don't know
how to write it. Any suggestions?
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look !
Part I. Meaning of...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
| |
by: jinu1996 |
last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth.
The Art of Business Website Design
Your website is...
|
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: isladogs |
last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM).
In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules.
He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms.
Adolph will...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: muto222 |
last post by:
How can i add a mobile payment intergratation into php mysql website.
| |
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |