469,903 Members | 2,020 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,903 developers. It's quick & easy.

Large Amount of Data

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
May 25 '07 #1
16 2389
On May 25, 10:50 am, "Jack" <nos...@invalid.comwrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
The OS will take care of memory swapping. It might get slow, but I
don't think it should fail.

Matt

May 25 '07 #2
In <H8******************************@comcast.com>, Jack wrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch
May 25 '07 #3
Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
In <H8******************************@comcast.com>, Jack wrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

May 25 '07 #4
Jack wrote:
Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
>In <H8******************************@comcast.com>, Jack wrote:
>>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

Purchase more memory. It is REALLY cheap these days.

-Larry
May 25 '07 #5
On 5/25/07, Jack <no****@invalid.comwrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list
Could you process it in chunks, instead of reading in all the data at once?
May 26 '07 #6
Larry Bates wrote:
Jack wrote:
>Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
>>In <H8******************************@comcast.com>, Jack wrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch
Purchase more memory. It is REALLY cheap these days.
Not a solution at all. What about if amount of data exceed architecture
memory limits? i.e. 4Gb at 32bit.

Better solution is to use database for data storage/processing

--
Vyacheslav Maslov
May 26 '07 #7
Jack wrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
What are you trying to do? At one extreme, you're implementing something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle
May 26 '07 #8
I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

"John Nagle" <na***@animats.comwrote in message
news:nf*****************@newssvr23.news.prodigy.ne t...
Jack wrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

What are you trying to do? At one extreme, you're implementing
something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle

May 26 '07 #9
I suppose I can but it won't be very efficient. I can have a smaller
hashtable,
and process those that are in the hashtable and save the ones that are not
in the hash table for another round of processing. But chunked hashtable
won't work that well because you don't know if they exist in other chunks.
In order to do this, I'll need to have a rule to partition the data into
chunks.
So this is more work in general.

"kaens" <ap***************@gmail.comwrote in message
news:ma***************************************@pyt hon.org...
On 5/25/07, Jack <no****@invalid.comwrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

Could you process it in chunks, instead of reading in all the data at
once?

May 26 '07 #10
If swap memery can not handle this efficiently, I may need to partition
data to multiple servers and use RPC to communicate.

"Dennis Lee Bieber" <wl*****@ix.netcom.comwrote in message
news:YY******************@newsread1.news.pas.earth link.net...
On Fri, 25 May 2007 11:11:28 -0700, "Jack" <no****@invalid.com>
declaimed the following in comp.lang.python:
>Thanks for the replies!

Database will be too slow for what I want to do.
Slower than having every process on the computer potentially slowed
down due to page swapping (and, for really huge data, still running the
risk of exceeding the single-process address space)?
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

May 26 '07 #11
In <da******************************@comcast.com>, Jack wrote:
I have tens of millions (could be more) of document in files. Each of them
has other properties in separate files. I need to check if they exist,
update and merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and updating a database will take a long time...
But databases are exactly build and optimized to handle large amounts of
data.
Let's say, I want to do something a search engine needs to do in terms
of the amount of data to be processed on a server. I doubt any serious
search engine would use a database for indexing and searching. A hash
table is what I need, not powerful queries.
You are not forced to use complex queries and an index is much like a hash
table, often even implemented as a hash table. And a database doesn't
have to be an SQL database. The `shelve` module or an object DB like zodb
or Durus are databases too.

Maybe you should try it and measure before claiming it's going to be too
slow and spend time to implement something like a database yourself.

Ciao,
Marc 'BlackJack' Rintsch
May 26 '07 #12
Jack wrote:
"John Nagle" <na***@animats.comwrote in message
news:nf*****************@newssvr23.news.prodigy.ne t...
>Jack wrote:
>>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
What are you trying to do? At one extreme, you're implementing
something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.
I have tens of millions (could be more) of document in files. Each of
them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...
And I think you are wrong. But of course the only way to find out who's
right and who's wrong is to do some experiments and get some benchmark
timings.

All I *would* say is that it's unwise to proceed with a memory-only
architecture when you only have assumptions about the limitations of
particular architectures, and your problem might actually grow to exceed
the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to
take a real nose-dive. Then where do you go? Much better to architect
the application so that you anticipate exceeding memory limits from the
start, I'd hazard.
Let's say, I want to do something a search engine needs to do in
terms of
the amount of
data to be processed on a server. I doubt any serious search engine
would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.
You might be surprised. Google, for example, use a widely-distributed
and highly-redundant storage format, but they certainly don't keep the
whole Internet in memory :-)

Perhaps you need to explain the problem in more detail if you still need
help.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 26 '07 #13
On May 26, 6:17 pm, "Jack" <nos...@invalid.comwrote:
I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...
Don't think, benchmark.
>
Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.
Having a single hash table permits two not very powerful query
methods: (1) return the data associated with a single hash key (2)
trawl through the whole hash table, applying various conditions to the
data. If that is all you want, then comparisons with a serious search
engine are quite irrelevant.

What is relevant is that the whole hash table has be in virtual memory
before you can start either type of query. This is not the case with a
database. Type 1 queries (with a suitable index on the primary key)
should use only a fraction of the memory that a full hash table would.

What is the primary key of your data?

May 27 '07 #14
I'll save them in a file for further processing.

"John Machin" <sj******@lexicon.netwrote in message
news:11*********************@q19g2000prn.googlegro ups.com...
On May 26, 6:17 pm, "Jack" <nos...@invalid.comwrote:
>I have tens of millions (could be more) of document in files. Each of
them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

May 27 '07 #15
On May 27, 11:24 am, "Jack" <nos...@invalid.comwrote:
I'll save them in a file for further processing.
Further processing would be what?
Did you read the remainder of what I wrote?

May 27 '07 #16
John, thanks for your reply. I will then use the files as input to generate
an index. So the
files are temporary, and provide some attributes in the index. So I do this
multiple times
to gather different attributes, merge, etc.

"John Machin" <sj******@lexicon.netwrote in message
news:11**********************@o11g2000prd.googlegr oups.com...
On May 27, 11:24 am, "Jack" <nos...@invalid.comwrote:
>I'll save them in a file for further processing.

Further processing would be what?
Did you read the remainder of what I wrote?

May 27 '07 #17

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

1 post views Thread by Robert May | last post: by
7 posts views Thread by =?Utf-8?B?TW9iaWxlTWFu?= | last post: by
4 posts views Thread by bcomeara | last post: by
22 posts views Thread by Jesse Burns | last post: by
1 post views Thread by Waqarahmed | last post: by
reply views Thread by Salome Sato | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.