473,407 Members | 2,315 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,407 software developers and data experts.

Large Amount of Data

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
May 25 '07 #1
16 2558
On May 25, 10:50 am, "Jack" <nos...@invalid.comwrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
The OS will take care of memory swapping. It might get slow, but I
don't think it should fail.

Matt

May 25 '07 #2
In <H8******************************@comcast.com>, Jack wrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch
May 25 '07 #3
Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
In <H8******************************@comcast.com>, Jack wrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

May 25 '07 #4
Jack wrote:
Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
>In <H8******************************@comcast.com>, Jack wrote:
>>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch

Purchase more memory. It is REALLY cheap these days.

-Larry
May 25 '07 #5
On 5/25/07, Jack <no****@invalid.comwrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list
Could you process it in chunks, instead of reading in all the data at once?
May 26 '07 #6
Larry Bates wrote:
Jack wrote:
>Thanks for the replies!

Database will be too slow for what I want to do.

"Marc 'BlackJack' Rintsch" <bj****@gmx.netwrote in message
news:pa****************************@gmx.net...
>>In <H8******************************@comcast.com>, Jack wrote:

I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?
What about putting the data into a database? If the keys are strings the
`shelve` module might be a solution.

Ciao,
Marc 'BlackJack' Rintsch
Purchase more memory. It is REALLY cheap these days.
Not a solution at all. What about if amount of data exceed architecture
memory limits? i.e. 4Gb at 32bit.

Better solution is to use database for data storage/processing

--
Vyacheslav Maslov
May 26 '07 #7
Jack wrote:
I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
What are you trying to do? At one extreme, you're implementing something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle
May 26 '07 #8
I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...

Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.

"John Nagle" <na***@animats.comwrote in message
news:nf*****************@newssvr23.news.prodigy.ne t...
Jack wrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.

What are you trying to do? At one extreme, you're implementing
something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.

John Nagle

May 26 '07 #9
I suppose I can but it won't be very efficient. I can have a smaller
hashtable,
and process those that are in the hashtable and save the ones that are not
in the hash table for another round of processing. But chunked hashtable
won't work that well because you don't know if they exist in other chunks.
In order to do this, I'll need to have a rule to partition the data into
chunks.
So this is more work in general.

"kaens" <ap***************@gmail.comwrote in message
news:ma***************************************@pyt hon.org...
On 5/25/07, Jack <no****@invalid.comwrote:
>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
--
http://mail.python.org/mailman/listinfo/python-list

Could you process it in chunks, instead of reading in all the data at
once?

May 26 '07 #10
If swap memery can not handle this efficiently, I may need to partition
data to multiple servers and use RPC to communicate.

"Dennis Lee Bieber" <wl*****@ix.netcom.comwrote in message
news:YY******************@newsread1.news.pas.earth link.net...
On Fri, 25 May 2007 11:11:28 -0700, "Jack" <no****@invalid.com>
declaimed the following in comp.lang.python:
>Thanks for the replies!

Database will be too slow for what I want to do.
Slower than having every process on the computer potentially slowed
down due to page swapping (and, for really huge data, still running the
risk of exceeding the single-process address space)?
--
Wulfraed Dennis Lee Bieber KD6MOG
wl*****@ix.netcom.com wu******@bestiaria.com
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: we******@bestiaria.com)
HTTP://www.bestiaria.com/

May 26 '07 #11
In <da******************************@comcast.com>, Jack wrote:
I have tens of millions (could be more) of document in files. Each of them
has other properties in separate files. I need to check if they exist,
update and merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and updating a database will take a long time...
But databases are exactly build and optimized to handle large amounts of
data.
Let's say, I want to do something a search engine needs to do in terms
of the amount of data to be processed on a server. I doubt any serious
search engine would use a database for indexing and searching. A hash
table is what I need, not powerful queries.
You are not forced to use complex queries and an index is much like a hash
table, often even implemented as a hash table. And a database doesn't
have to be an SQL database. The `shelve` module or an object DB like zodb
or Durus are databases too.

Maybe you should try it and measure before claiming it's going to be too
slow and spend time to implement something like a database yourself.

Ciao,
Marc 'BlackJack' Rintsch
May 26 '07 #12
Jack wrote:
"John Nagle" <na***@animats.comwrote in message
news:nf*****************@newssvr23.news.prodigy.ne t...
>Jack wrote:
>>I need to process large amount of data. The data structure fits well
in a dictionary but the amount is large - close to or more than the size
of physical memory. I wonder what will happen if I try to load the data
into a dictionary. Will Python use swap memory or will it fail?

Thanks.
What are you trying to do? At one extreme, you're implementing
something
like a search engine that needs gigabytes of bitmaps to do joins fast as
hundreds of thousands of users hit the server, and need to talk seriously
about 64-bit address space machines. At the other, you have no idea how
to either use a database or do sequential processing. Tell us more.
I have tens of millions (could be more) of document in files. Each of
them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...
And I think you are wrong. But of course the only way to find out who's
right and who's wrong is to do some experiments and get some benchmark
timings.

All I *would* say is that it's unwise to proceed with a memory-only
architecture when you only have assumptions about the limitations of
particular architectures, and your problem might actually grow to exceed
the memory limits of a 32-bit architecture anyway.

Swapping might, depending on access patterns, cause you performance to
take a real nose-dive. Then where do you go? Much better to architect
the application so that you anticipate exceeding memory limits from the
start, I'd hazard.
Let's say, I want to do something a search engine needs to do in
terms of
the amount of
data to be processed on a server. I doubt any serious search engine
would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.
You might be surprised. Google, for example, use a widely-distributed
and highly-redundant storage format, but they certainly don't keep the
whole Internet in memory :-)

Perhaps you need to explain the problem in more detail if you still need
help.

regards
Steve
--
Steve Holden +1 571 484 6266 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
------------------ Asciimercial ---------------------
Get on the web: Blog, lens and tag your way to fame!!
holdenweb.blogspot.com squidoo.com/pythonology
tagged items: del.icio.us/steve.holden/python
All these services currently offer free registration!
-------------- Thank You for Reading ----------------

May 26 '07 #13
On May 26, 6:17 pm, "Jack" <nos...@invalid.comwrote:
I have tens of millions (could be more) of document in files. Each of them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.
And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

And this is not a one time job. Because of the quantity of the files, I
think querying and
updating a database will take a long time...
Don't think, benchmark.
>
Let's say, I want to do something a search engine needs to do in terms of
the amount of
data to be processed on a server. I doubt any serious search engine would
use a database
for indexing and searching. A hash table is what I need, not powerful
queries.
Having a single hash table permits two not very powerful query
methods: (1) return the data associated with a single hash key (2)
trawl through the whole hash table, applying various conditions to the
data. If that is all you want, then comparisons with a serious search
engine are quite irrelevant.

What is relevant is that the whole hash table has be in virtual memory
before you can start either type of query. This is not the case with a
database. Type 1 queries (with a suitable index on the primary key)
should use only a fraction of the memory that a full hash table would.

What is the primary key of your data?

May 27 '07 #14
I'll save them in a file for further processing.

"John Machin" <sj******@lexicon.netwrote in message
news:11*********************@q19g2000prn.googlegro ups.com...
On May 26, 6:17 pm, "Jack" <nos...@invalid.comwrote:
>I have tens of millions (could be more) of document in files. Each of
them
has other
properties in separate files. I need to check if they exist, update and
merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?

May 27 '07 #15
On May 27, 11:24 am, "Jack" <nos...@invalid.comwrote:
I'll save them in a file for further processing.
Further processing would be what?
Did you read the remainder of what I wrote?

May 27 '07 #16
John, thanks for your reply. I will then use the files as input to generate
an index. So the
files are temporary, and provide some attributes in the index. So I do this
multiple times
to gather different attributes, merge, etc.

"John Machin" <sj******@lexicon.netwrote in message
news:11**********************@o11g2000prd.googlegr oups.com...
On May 27, 11:24 am, "Jack" <nos...@invalid.comwrote:
>I'll save them in a file for further processing.

Further processing would be what?
Did you read the remainder of what I wrote?

May 27 '07 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: steve | last post by:
Hi, I have researched but have not found a good solution to this problem. I am importing large amounts of data (over 50 Meg) into a new mysql db that I set up. I use >mysql dbname <...
1
by: Robert May | last post by:
Hi, I am trying to execute some code compiled by g++ on Linux and have found that after some time, the program allocates a huge amount of swap space (250MB on my machine which has 512MB...
5
by: Mike | last post by:
This is a general question on the best way to import a large amount of data to a MS-SQL DB. I can have the data in just about any format I need to, I just don't know how to import the data. I...
11
by: Macca | last post by:
Hi, I'm writing an application that will pass a large amount of data between classes/functions. In C++ it was more efficient to send a pointer to the object, e.g structure rather than passing...
3
by: kamran | last post by:
Hi, I have a web service that may return a very large amount of data. I want that data to return in chunks, like first return 10% of data than return the next 10% and so on, until all is...
7
by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:
Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're...
4
by: bcomeara | last post by:
I am writing a program which needs to include a large amount of data. Basically, the data are p values for different possible outcomes from trials with different number of observations (the p...
22
by: Jesse Burns | last post by:
I'm about to start working on my first large scale site (in my opinion) that will hopefully have 1000+ users a day. ok, this isn't on the google/facebook scale, but it's going to be have more hits...
0
by: jcatubay | last post by:
I have a function that returns a list more than 200000 objects and the object has 37 fields. I added the wcf as a web reference so i dont have to add any configuration item in my web apps config...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.