Large Amount of Data

Jack

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

May 25 '07 #1

Subscribe Post Reply

2558

Matimus

On May 25, 10:50 am, "Jack" <nos...@invalid.comwrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

The OS will take care of memory swapping. It might get slow, but I
don't think it should fail.

Matt

May 25 '07 #2

Marc 'BlackJack' Rintsch

In <H8******************************@comcast.com>, Jack wrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

May 25 '07 #3

Jack

Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...

In <H8******************************@comcast.com>, Jack wrote:

>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

May 25 '07 #4

Larry Bates

Jack wrote:

Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
>In <H8******************************@comcast.com>, Jack wrote:

>>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

Purchase more memory. It is REALLY cheap these days.

-Larry

May 25 '07 #5

kaens

On 5/25/07, Jack <no****@invalid.comwrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

Could you process it in chunks, instead of reading in all the data at once?

May 26 '07 #6

Vyacheslav Maslov

Larry Bates wrote:

Jack wrote:
>Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
>>In <H8******************************@comcast.com>, Jack wrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

Purchase more memory. It is REALLY cheap these days.

Not a solution at all. What about if amount of data exceed architecture
memory limits? i.e. 4Gb at 32bit.

Better solution is to use database for data storage/processing

--
Vyacheslav Maslov

May 26 '07 #7

John Nagle

Jack wrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

What are you trying to do? At one extreme, you're implementing something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle

May 26 '07 #8

Jack

I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

"John Nagle" <na***@animats.comwrote in message
news:nf*****************@newssvr23.news.prodigy.ne t...

Jack wrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

What are you trying to do? At one extreme, you're implementing
something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle

May 26 '07 #9

Jack

I suppose I can but it won't be very efficient. I can have a smaller
hashtable,
and process those that are in the hashtable and save the ones that are not
in the hash table for another round of processing. But chunked hashtable
won't work that well because you don't know if they exist in other chunks.
In order to do this, I'll need to have a rule to partition the data into
chunks.
So this is more work in general.

"kaens" <ap***************@gmail.comwrote in message
news:ma***************************************@pyt hon.org...

On 5/25/07, Jack <no****@invalid.comwrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

Could you process it in chunks, instead of reading in all the data at
once?

May 26 '07 #10

Jack

If swap memery can not handle this efficiently, I may need to partition
data to multiple servers and use RPC to communicate.

"Dennis Lee Bieber" <wl*****@ix.netcom.comwrote in message
news:YY******************@newsread1.news.pas.earth link.net...

On Fri, 25 May 2007 11:11:28 -0700, "Jack" <no****@invalid.com>
declaimed the following in comp.lang.python:

>Thanks for the replies!

Database will be too slow for what I want to do.

Slower than having every process on the computer potentially slowed
down due to page swapping (and, for really huge data, still running the
risk of exceeding the single-process address space)?
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

May 26 '07 #11

Marc 'BlackJack' Rintsch

In <da******************************@comcast.com>, Jack wrote:

I have tens of millions (could be more) of document in files. Each of them
has other properties in separate files. I need to check if they exist,
update and merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and updating a database will take a long time...

But databases are exactly build and optimized to handle large amounts of
data.

Let's say, I want to do something a search engine needs to do in terms
of the amount of data to be processed on a server. I doubt any serious
search engine would use a database for indexing and searching. A hash
table is what I need, not powerful queries.

You are not forced to use complex queries and an index is much like a hash
table, often even implemented as a hash table. And a database doesn't
have to be an SQL database. The `shelve` module or an object DB like zodb
or Durus are databases too.

Maybe you should try it and measure before claiming it's going to be too
slow and spend time to implement something like a database yourself.

Ciao,
Marc 'BlackJack' Rintsch

May 26 '07 #12

Steve Holden

Jack wrote:

"John Nagle" <na***@animats.comwrote in message
news:nf*****************@newssvr23.news.prodigy.ne t...
>Jack wrote:
>>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
What are you trying to do? At one extreme, you're implementing
something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

I have tens of millions (could be more) of document in files. Each of

them

has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

And I think you are wrong. But of course the only way to find out who's
right and who's wrong is to do some experiments and get some benchmark
timings.

All I *would* say is that it's unwise to proceed with a memory-only
architecture when you only have assumptions about the limitations of
particular architectures, and your problem might actually grow to exceed
the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to
take a real nose-dive. Then where do you go? Much better to architect
the application so that you anticipate exceeding memory limits from the
start, I'd hazard.

Let's say, I want to do something a search engine needs to do in

terms of

the amount of
data to be processed on a server. I doubt any serious search engine

would

use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

You might be surprised. Google, for example, use a widely-distributed
and highly-redundant storage format, but they certainly don't keep the
whole Internet in memory :-)

Perhaps you need to explain the problem in more detail if you still need
help.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 26 '07 #13

John Machin

On May 26, 6:17 pm, "Jack" <nos...@invalid.comwrote:

I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Don't think, benchmark.

>
Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

Having a single hash table permits two not very powerful query
methods: (1) return the data associated with a single hash key (2)
trawl through the whole hash table, applying various conditions to the
data. If that is all you want, then comparisons with a serious search
engine are quite irrelevant.

What is relevant is that the whole hash table has be in virtual memory
before you can start either type of query. This is not the case with a
database. Type 1 queries (with a suitable index on the primary key)
should use only a fraction of the memory that a full hash table would.

What is the primary key of your data?

May 27 '07 #14

Jack

I'll save them in a file for further processing.

"John Machin" <sj******@lexicon.netwrote in message
news:11*********************@q19g2000prn.googlegro ups.com...

On May 26, 6:17 pm, "Jack" <nos...@invalid.comwrote:
>I have tens of millions (could be more) of document in files. Each of
them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

May 27 '07 #15

John Machin

On May 27, 11:24 am, "Jack" <nos...@invalid.comwrote:

I'll save them in a file for further processing.

Further processing would be what?
Did you read the remainder of what I wrote?

May 27 '07 #16

Jack

John, thanks for your reply. I will then use the files as input to generate
an index. So the
files are temporary, and provide some attributes in the index. So I do this
multiple times
to gather different attributes, merge, etc.

"John Machin" <sj******@lexicon.netwrote in message
news:11**********************@o11g2000prd.googlegr oups.com...

On May 27, 11:24 am, "Jack" <nos...@invalid.comwrote:
>I'll save them in a file for further processing.

Further processing would be what?
Did you read the remainder of what I wrote?

May 27 '07 #17

by: steve | last post by:

Hi, I have researched but have not found a good solution to this problem. I am importing large amounts of data (over 50 Meg) into a new mysql db that I set up. I use >mysql dbname <...

MySQL Database

Large swap allocation bug

by: Robert May | last post by:

Hi, I am trying to execute some code compiled by g++ on Linux and have found that after some time, the program allocates a huge amount of swap space (250MB on my machine which has 512MB...

C / C++

SQL Import of LArge Amount of Data

by: Mike | last post by:

This is a general question on the best way to import a large amount of data to a MS-SQL DB. I can have the data in just about any format I need to, I just don't know how to import the data. I...

Microsoft SQL Server

Passing large amounts of data between classes/functions

by: Macca | last post by:

Hi, I'm writing an application that will pass a large amount of data between classes/functions. In C++ it was more efficient to send a pointer to the object, e.g structure rather than passing...

C# / C Sharp

Returning large number of results from a WS

by: kamran | last post by:

Hi, I have a web service that may return a very large amount of data. I want that data to return in chunks, like first return 10% of data than return the next 10% and so on, until all is...

.NET Framework

Best practices for moving large amounts of data using WCF ...

by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:

Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're...

.NET Framework

Including large amounts of data in C++ binary

by: bcomeara | last post by:

I am writing a program which needs to include a large amount of data. Basically, the data are p values for different possible outcomes from trials with different number of observations (the p...

C / C++

Large Scale PHP Application Design Questions

by: Jesse Burns | last post by:

I'm about to start working on my first large scale site (in my opinion) that will hopefully have 1000+ users a day. ok, this isn't on the google/facebook scale, but it's going to be have more hits...

PHP

error when sending large data in wcf

by: jcatubay | last post by:

I have a function that returns a list more than 200000 objects and the object has 37 fields. I added the wcf as a web reference so i dont have to add any configuration item in my web apps config...

.NET Framework

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Large Amount of Data

Similar topics