473,322 Members | 1,480 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,322 software developers and data experts.

Re: Tremendous slowdown due to garbage collection

On Apr 12, 7:02 am, andreas.eis...@gmail.com wrote:
I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
Well, the garbage collector activates whenever allocations exceed
deallocations by a certain amount, which for obvious reasons is the
reason for the bottleneck wh

My first stab at a suggestion would be to adjust the threshold
depending on how successful it is. So if a garbage collection run
collects no objects, double (or whatever) the threshold until the next
run.

More formulaicly, if your threshold is X, and a garbage collection
pass collects Y objects, the the threshold for the next pass would be
something like 2*X/(Y/X+1). Then you can create your very large data
structure in only O(N log N) time.
But as Steve Holden said, it'll probably mess someone else up.
There's not much you can do with a garbage collector to make it please
everyone.
A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.
Carl Banks
Jun 27 '08 #1
4 2006
A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.
Well, probably because you can get better
than 100x improved performance
if you don't involve the disk and use clever in memory indexing.
BTW, I think the default behaviour of the gc is
pretty silly. I tend to disable automatic gc and explicitly put in
collections when I know I'm done with some big operation these
days.
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydis...TEXT=dumb+slow
Jun 27 '08 #2
On Apr 14, 4:27 pm, Aaron Watters <aaron.watt...@gmail.comwrote:
A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.

Well, probably because you can get better
than 100x improved performance
if you don't involve the disk and use clever in memory indexing.
Are you sure it won't involve disk use? I'm just throwing this out
there, but if you're creating a hundreds of megabytes structure in
memory there's a chance the OS will swap it out to disk, which defeats
any improvements in latency you would have gotten.

However, that is for the OP to decide. The reason I don't like the
sort of question I posed is it's presumptuous--maybe the OP already
considered and rejected this, and has taken steps to ensure the in
memory data structure won't be swapped--but a database solution should
at least be considered here.
Carl Banks
Jun 27 '08 #3
On Apr 14, 11:18 pm, Carl Banks <pavlovevide...@gmail.comwrote:
However, that is for the OP to decide. The reason I don't like the
sort of question I posed is it's presumptuous--maybe the OP already
considered and rejected this, and has taken steps to ensure the in
memory data structure won't be swapped--but a database solution should
at least be considered here.
Yes, you are right, especially if the index structure will be needed
many times over a long period of time. Even here though, these days,
you can go pretty far by loading everything into core (streaming from
disk) and dumping everything out when you are done, if needed
(ahem, using the preferred way to do this from python for
speed and safety: marshal ;) ).

Even with Btree's if you jump around in the tree the performance can
be
awful. This is why Nucular, for example, is designed to stream
results sequentially from disk whenever possible. The one place where
it doesn't do this very well (proximity searches) shows the most
problems with performance (under bad circumstances like searching
for two common words in proximity).
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydis...?FREETEXT=joys
Jun 27 '08 #4
Aaron Watters <aa***********@gmail.comwrites:
Even with Btree's if you jump around in the tree the performance can
be awful.
The Linux file cache really helps. The simplest approach is to just
"cat" the index files to /dev/null a few times an hour. Slightly
faster (what I do with Solr) is mmap the files into memory and read a
byte from each page now and then. Assuming (as in Lucene) that the
index file format is compressed, this approach is far more
ram-efficient than actually unpacking the index into data
structures. though of course you take the overhead (a few
microseconds) of a couple system calls at each access to the index
even when it's all in cache.
Jun 27 '08 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Ganesh | last post by:
Is there a utility by microsoft (or anyone) to force garbage collection in a process without have access to the process code. regards Ganesh
11
by: Rick | last post by:
Hi, My question is.. if Lisp, a 40 year old language supports garbage collection, why didn't the authors of C++ choose garbage collection for this language? Are there fundamental reasons behind...
22
by: Bradley | last post by:
Has anyone else noticed this problem? I converted the back-end to A2000 and the performance problem was fixed. We supply a 97 and 2000 version of our software so we kept the backend in A97 to make...
8
by: mike2036 | last post by:
For some reason it appears that garbage collection is releasing an object that I'm still using. The object is declared in a module and instantiated within a class that is in turn instantiated by...
56
by: Johnny E. Jensen | last post by:
Hellow I'am not sure what to think about the Garbage Collector. I have a Class OutlookObject, It have two private variables. Private Microsoft.Office.Interop.Outlook.Application _Application =...
0
by: Steve Holden | last post by:
I think the linguists of the world should write better automated translation systems. Not being an expert in these details I would like to ask the gurus how it could be done. There are going to...
0
by: guillaume weymeskirch | last post by:
Hello everybody, To test the python 2.5 garbage collector, I wrote a trivial script allocating dummy objects of various sizes, then forgetting them in a loop. The garbage collector seems...
1
by: Terry Reedy | last post by:
guillaume weymeskirch wrote: On a related note, there have been past threads reporting that allocating and freeing increasingly long arrays, especially of tuples (as I remember) can take more...
158
by: pushpakulkar | last post by:
Hi all, Is garbage collection possible in C++. It doesn't come as part of language support. Is there any specific reason for the same due to the way the language is designed. Or it is...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.