473,500 Members | 1,898 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Orders of magnitude

x = [chr(a) + chr(b) for a in xrange(100) for b in xrange(100)]

# list version: c = []
for i in x:
if i not in c:
c.append(i)
t1.timeit(1) 2.0331145438875637

# dict version: c = {}
for i in x:
if i not in c:
c[i] = None t2.timeit(1) 0.0067952770534134288

# bsddb version: c = bsddb.btopen(None)
for i in x:
if i not in c:
c[i] = None t3.timeit(1)

0.18430750276922936

Wow. Dicts are *fast*.

I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?
Robert Brewer
MIS
Amor Ministries
fu******@amor.org

Jul 18 '05 #1
13 4672
"Robert Brewer" <fu******@amor.org> writes:
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?


Sort, then remove dups.
Jul 18 '05 #2
Sort, then remove dups.


A list 10 million integers suck up ~160 megs of memory with Python. I
doubt the strings would fit even then.

I would suggest a multi-phase method. Break the sequence up into 100k
element blocks each and remove duplicates using dictionaries. Sort each
block individually and write each to a file.

Merge the 100 files by using a heapq.
#heap is structured like...
[("next string in file a", file_handle_a),
("next string in file b", file_handle_b),...]

If you keep using heappop and heappush from heapq, along with a single
string of memory, you can remove duplicates quite easily, and certainly
won't run out of memory.

- Josiah
Jul 18 '05 #3
PF


It all boils down to how much space your keys take.
When you look for dupes, you must hold only the keys in memory, not the
data (it'll be a lot faster this way).

I'd say create a bsddb with btree sort to hold all your keys. Should take
about 20 minutues to fill it. Then scan it in sorted key order, and
duplciates will appear next to each other.
Jul 18 '05 #4
Josiah Carlson <jc******@nospam.uci.edu> writes:
Sort, then remove dups.


A list 10 million integers suck up ~160 megs of memory with Python. I
doubt the strings would fit even then.


I meant sort externally. I thought these records were in a disk file
or something.
Jul 18 '05 #5
Robert Brewer wrote:
# bsddb version: c = bsddb.btopen(None)
for i in x:
if i not in c:
c[i] = None


You might try a minor change:

for i in x:
c[i] = None

The fewer the redundancies the larger the speed gain (I suppose; that really
needs testing to be sure).

Peter

Jul 18 '05 #6
Bonjour !

Les dictionnaires sont plus rapides lors de la recherche ("... in ..."),
car il y a utilisation de la clef.
Avec une liste, la recherche est séquentielle.
in english (try) :

For search (... in ...) :
- dictionnary use key search
- list use sequentiel search


@-salutations
--
Michel Claveau
Jul 18 '05 #7
"Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?


In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set

ks
Jul 18 '05 #8
Buck Nuggets wrote:
"Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?

In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set


Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.

No offense, just trying to get you on the right track!

ciao - chris
--
Christian Tismer :^) <mailto:ti****@stackless.com>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/
Jul 18 '05 #9
Buck Nuggets wrote:
step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set


It might be worth mentioning that if your idea of 'sameness' is
identical textual representation, the '-u' flag of 'sort' does step 2
for you. Won't work in the more general case, but then 'sort' might not
work either. YMMV.

Stefan,
--
Stefan Axelsson (email at http://www.cs.chalmers.se/~sax)
Jul 18 '05 #10
Christian Tismer <ti****@stackless.com> wrote in message news:<ma*************************************@pyth on.org>...
Buck Nuggets wrote:
"Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...

In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set


Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.

No offense, just trying to get you on the right track!


Ha, that's ok. I've been doing exactly this kind of thing for over
twenty years (crusty old database developer). I think that you will
find that it is more efficient in both development and run time. And
it's simple enough that once you start down this path you won't need
to brainstorm on how to get it to work.

Rather than taking 2-18 hours with the previously mentioned solutions
(which require index-building and 10 million index lookups), you'll
probably do the entire thing in about 10 minutes (9 minutes to sort
file + 1 minute to check dups).
Jul 18 '05 #11
On 30 Mar 2004 06:57:16 -0800, bu*********@yahoo.com (Buck Nuggets)
wrote:
Christian Tismer <ti****@stackless.com> wrote in message news:<ma*************************************@pyth on.org>...
Buck Nuggets wrote:
> "Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...
>
> In case you are interested in alternatives approaches...here's how I
> typically do this:
>
> step 1: sort the file using a separate sort utility (unix sort, cygwin
> sort, etc)
>
> step 2: have a python program read in rows,
> compare each row to the prior,
> write out only one row for each set


Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.

No offense, just trying to get you on the right track!


Ha, that's ok. I've been doing exactly this kind of thing for over
twenty years (crusty old database developer). I think that you will
find that it is more efficient in both development and run time. And
it's simple enough that once you start down this path you won't need
to brainstorm on how to get it to work.

Rather than taking 2-18 hours with the previously mentioned solutions
(which require index-building and 10 million index lookups), you'll
probably do the entire thing in about 10 minutes (9 minutes to sort
file + 1 minute to check dups).

From a crusty old unix developer to a crusty old database developer...
Part 2 can be done by piping the output of sort to the 'uniq' program
(available in cygwin and mingw also, I think).

And it's no effort, if it fits the bill. It may be inneficient with
regards to sorting algorithms, but extremely efficient in terms of
system and developer resources.

--dang
Jul 18 '05 #12
Dang Griffith <no*****@noemail4u.com> wrote in message news:<8e******************************@news.terane ws.com>...
From a crusty old unix developer to a crusty old database developer...
Part 2 can be done by piping the output of sort to the 'uniq' program
(available in cygwin and mingw also, I think).

And it's no effort, if it fits the bill. It may be inneficient with
regards to sorting algorithms, but extremely efficient in terms of
system and developer resources.


ha, yep - forgot to mention that you can drop dupes right in some sort
utilities or via a program like uniq.

I suppose the only reason to use python after the sort would be if you
have some unusual rules to include in the delta calculation (don't
include this record if status-field='del', etc).

ks
Jul 18 '05 #13
Dang Griffith <no*****@noemail4u.com> writes:
From a crusty old unix developer to a crusty old database developer...
Part 2 can be done by piping the output of sort to the 'uniq' program
(available in cygwin and mingw also, I think).


From a maybe even crustier unix developer: "sort -u" strips dups.
Jul 18 '05 #14

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1460
by: Chris | last post by:
Hi, I am currently designing a simple service orders database. I have played around with MySQL a bit but this is the first time I'm using it in anger, I have a few design queries to make sure I am...
2
1857
by: news | last post by:
This is a tough question to ask. I need to pull up a report of all the orders in our e-commerce site, that has more than one item attributed to it. orders.ordernum, orderitems.itemnumber...
4
3567
by: Steve | last post by:
I have a products table where the PK is ProductID. Also have the standard Orders table and OrderDetails table. I created a query that joins the Orders table and OrderDetails table. The query...
3
5087
by: Anne | last post by:
Hi I have got a table with some sales transactions. It has got 3 fields 1) Ordernumber 2) Item number 3) Shippingdate Now I want to make ONE (not two) query that shows how many different...
8
14361
by: Mantorok Redgormor | last post by:
From least to greatest is it sign magnitude ones complement two's complement Where sign magnitude is the least way to represent integers and two's complement is the best way to represent...
2
883
by: Chris | last post by:
Hi, I have a web app that users enter orders, then another department approve the orders. I want to be able to have the processed orders be printed automatically. How should I go about doing this?...
8
2031
by: Martin | last post by:
I hope not, but, I think the answer to this question is "it can't be done". Northwind sample database. Orders form. Go to a new record. Select a customer in "Bill To:" Don't enter any...
0
955
by: yadin | last post by:
hi! how do you do contour maps with vtk? i have 3600 points for each point i have a corresponding magnitude. i plane to do contour maps that is i plot each point with a different color...
2
1299
by: ruth m | last post by:
I have a Customer Details form which has two tabbed pages on it – one of which has a subform with a list of that particular customer’s orders on it. For some reason, when scrolling through the...
2
2556
by: bnashenas1984 | last post by:
Hi everyone I'm building an Ecommerce website with PHP and MYSQL. What I need to do is to categorize orders for each month or year to show a statistic table The DB table looks like this : ID :...
0
7136
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7018
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
5490
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
4923
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
3110
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
3106
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1430
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
672
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
316
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.