Orders of magnitude

Robert Brewer

x = [chr(a) + chr(b) for a in xrange(100) for b in xrange(100)]

# list version: c = []
for i in x:
if i not in c:
c.append(i)

t1.timeit(1) 2.0331145438875637

# dict version: c = {}
for i in x:
if i not in c:
c[i] = None t2.timeit(1) 0.0067952770534134288

# bsddb version: c = bsddb.btopen(None)
for i in x:
if i not in c:
c[i] = None t3.timeit(1)

0.18430750276922936

Wow. Dicts are *fast*.

I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?
Robert Brewer
MIS
Amor Ministries
fu******@amor.org

Jul 18 '05 #1

Subscribe Reply

4672

Paul Rubin

"Robert Brewer" <fu******@amor.org> writes:

I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?

Sort, then remove dups.

Jul 18 '05 #2

Josiah Carlson

Sort, then remove dups.

A list 10 million integers suck up ~160 megs of memory with Python. I
doubt the strings would fit even then.

I would suggest a multi-phase method. Break the sequence up into 100k
element blocks each and remove duplicates using dictionaries. Sort each
block individually and write each to a file.

Merge the 100 files by using a heapq.
#heap is structured like...
[("next string in file a", file_handle_a),
("next string in file b", file_handle_b),...]

If you keep using heappop and heappush from heapq, along with a single
string of memory, you can remove duplicates quite easily, and certainly
won't run out of memory.

- Josiah

Jul 18 '05 #3

It all boils down to how much space your keys take.
When you look for dupes, you must hold only the keys in memory, not the
data (it'll be a lot faster this way).

I'd say create a bsddb with btree sort to hold all your keys. Should take
about 20 minutues to fill it. Then scan it in sorted key order, and
duplciates will appear next to each other.

Jul 18 '05 #4

Paul Rubin

Josiah Carlson <jc******@nospam.uci.edu> writes:

Sort, then remove dups.

A list 10 million integers suck up ~160 megs of memory with Python. I
doubt the strings would fit even then.

I meant sort externally. I thought these records were in a disk file
or something.

Jul 18 '05 #5

Peter Otten

Robert Brewer wrote:

# bsddb version: c = bsddb.btopen(None)
for i in x:
if i not in c:
c[i] = None

You might try a minor change:

for i in x:
c[i] = None

The fewer the redundancies the larger the speed gain (I suppose; that really
needs testing to be sure).

Peter

Jul 18 '05 #6

Michel Claveau/Hamster

Bonjour !

Les dictionnaires sont plus rapides lors de la recherche ("... in ..."),
car il y a utilisation de la clef.
Avec une liste, la recherche est séquentielle.
in english (try) :

For search (... in ...) :
- dictionnary use key search
- list use sequentiel search

@-salutations
--
Michel Claveau

Jul 18 '05 #7

Buck Nuggets

"Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...

I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?

In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set

ks

Jul 18 '05 #8

Christian Tismer

Buck Nuggets wrote:

"Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...
I'm dedup'ing a 10-million-record dataset, trying different approaches
for building indexes. The in-memory dicts are clearly faster, but I get
Memory Errors (Win2k, 512 MB RAM, 4 G virtual). Any recommendations on
other ways to build a large index without slowing down by a factor of
25?

In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set

Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.

No offense, just trying to get you on the right track!

ciao - chris
--
Christian Tismer :^) <mailto:ti****@stackless.com>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 mobile +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/

Jul 18 '05 #9

Stefan Axelsson

Buck Nuggets wrote:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set

It might be worth mentioning that if your idea of 'sameness' is
identical textual representation, the '-u' flag of 'sort' does step 2
for you. Won't work in the more general case, but then 'sort' might not
work either. YMMV.

Stefan,
--
Stefan Axelsson (email at http://www.cs.chalmers.se/~sax)

Jul 18 '05 #10

Buck Nuggets

Christian Tismer <ti****@stackless.com> wrote in message news:<ma*************************************@pyth on.org>...

Buck Nuggets wrote:
"Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...

In case you are interested in alternatives approaches...here's how I
typically do this:

step 1: sort the file using a separate sort utility (unix sort, cygwin
sort, etc)

step 2: have a python program read in rows,
compare each row to the prior,
write out only one row for each set

Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.

No offense, just trying to get you on the right track!

Ha, that's ok. I've been doing exactly this kind of thing for over
twenty years (crusty old database developer). I think that you will
find that it is more efficient in both development and run time. And
it's simple enough that once you start down this path you won't need
to brainstorm on how to get it to work.

Rather than taking 2-18 hours with the previously mentioned solutions
(which require index-building and 10 million index lookups), you'll
probably do the entire thing in about 10 minutes (9 minutes to sort
file + 1 minute to check dups).

Jul 18 '05 #11

Dang Griffith

On 30 Mar 2004 06:57:16 -0800, bu*********@yahoo.com (Buck Nuggets)
wrote:

Christian Tismer <ti****@stackless.com> wrote in message news:<ma*************************************@pyth on.org>...
Buck Nuggets wrote:
> "Robert Brewer" <fu******@amor.org> wrote in message news:<ma*************************************@pyth on.org>...
>
> In case you are interested in alternatives approaches...here's how I
> typically do this:
>
> step 1: sort the file using a separate sort utility (unix sort, cygwin
> sort, etc)
>
> step 2: have a python program read in rows,
> compare each row to the prior,
> write out only one row for each set

Good solution, but wayyyy too much effort.
You probably know it:
If you are seeking for duplicates, and doing it by
complete ordering, then you are thwowing lots of information
away, since you are not seeking for neighborship, right?
That clearly means: it must be inefficient.

No offense, just trying to get you on the right track!

Ha, that's ok. I've been doing exactly this kind of thing for over
twenty years (crusty old database developer). I think that you will
find that it is more efficient in both development and run time. And
it's simple enough that once you start down this path you won't need
to brainstorm on how to get it to work.

Rather than taking 2-18 hours with the previously mentioned solutions
(which require index-building and 10 million index lookups), you'll
probably do the entire thing in about 10 minutes (9 minutes to sort
file + 1 minute to check dups).

From a crusty old unix developer to a crusty old database developer...
Part 2 can be done by piping the output of sort to the 'uniq' program
(available in cygwin and mingw also, I think).

And it's no effort, if it fits the bill. It may be inneficient with
regards to sorting algorithms, but extremely efficient in terms of
system and developer resources.

--dang

Jul 18 '05 #12

Buck Nuggets

Dang Griffith <no*****@noemail4u.com> wrote in message news:<8e******************************@news.terane ws.com>...

From a crusty old unix developer to a crusty old database developer...
Part 2 can be done by piping the output of sort to the 'uniq' program
(available in cygwin and mingw also, I think).

And it's no effort, if it fits the bill. It may be inneficient with
regards to sorting algorithms, but extremely efficient in terms of
system and developer resources.

ha, yep - forgot to mention that you can drop dupes right in some sort
utilities or via a program like uniq.

I suppose the only reason to use python after the sort would be if you
have some unusual rules to include in the delta calculation (don't
include this record if status-field='del', etc).

ks

Jul 18 '05 #13

Paul Rubin

Dang Griffith <no*****@noemail4u.com> writes:

From a crusty old unix developer to a crusty old database developer...
Part 2 can be done by piping the output of sort to the 'uniq' program
(available in cygwin and mingw also, I think).

From a maybe even crustier unix developer: "sort -u" strips dups.

Jul 18 '05 #14

Similar topics

1460

MySQL Database design for placing service orders

by: Chris | last post by:

Hi, I am currently designing a simple service orders database. I have played around with MySQL a bit but this is the first time I'm using it in anger, I have a few design queries to make sure I am...

MySQL Database

1857

having mySQL find only orders with multiple rows in another table

by: news | last post by:

This is a tough question to ask. I need to pull up a report of all the orders in our e-commerce site, that has more than one item attributed to it. orders.ordernum, orderitems.itemnumber...

MySQL Database

3567

Make Products/Orders Query Updateable.

by: Steve | last post by:

I have a products table where the PK is ProductID. Also have the standard Orders table and OrderDetails table. I created a query that joins the Orders table and OrderDetails table. The query...

Microsoft Access / VBA

5087

Count number of orders and number of different itemnumbers

by: Anne | last post by:

Hi I have got a table with some sales transactions. It has got 3 fields 1) Ordernumber 2) Item number 3) Shippingdate Now I want to make ONE (not two) query that shows how many different...

Microsoft Access / VBA

14361

sign magnitude, ones complement, two's complement

by: Mantorok Redgormor | last post by:

From least to greatest is it sign magnitude ones complement two's complement Where sign magnitude is the least way to represent integers and two's complement is the best way to represent...

C / C++

883

printing processed orders automatically

by: Chris | last post by:

Hi, I have a web app that users enter orders, then another department approve the orders. I want to be able to have the processed orders be printed automatically. How should I go about doing this?...

ASP.NET

2031

Validating "Orders" form

by: Martin | last post by:

I hope not, but, I think the answer to this question is "it can't be done". Northwind sample database. Orders form. Go to a new record. Select a customer in "Bill To:" Don't enter any...

Microsoft Access / VBA

955

xyz points and magnitude to intensity or colormap or contourmap

by: yadin | last post by:

hi! how do you do contour maps with vtk? i have 3600 points for each point i have a corresponding magnitude. i plane to do contour maps that is i plot each point with a different color...

Python

1299

Customer Records duplicated according to how many orders are in the subform

by: ruth m | last post by:

I have a Customer Details form which has two tabbed pages on it – one of which has a subform with a list of that particular customer’s orders on it. For some reason, when scrolling through the...

Microsoft Access / VBA

2556

How to categorize orders for each month or year

by: bnashenas1984 | last post by:

Hi everyone I'm building an Ecommerce website with PHP and MYSQL. What I need to do is to categorize orders for each month or year to show a statistic table The DB table looks like this : ID :...

MySQL Database

7136

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

7018

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

5490

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

4923

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

3110

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

3106

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1430

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

672

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

316

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General