By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
445,876 Members | 1,191 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 445,876 IT Pros & Developers. It's quick & easy.

Re: Tremendous slowdown due to garbage collection

P: n/a
On Apr 12, 7:02 am, andreas.eis...@gmail.com wrote:
I would suggest to configure the default behaviour of the garbage
collector in such a way that this squared complexity is avoided
without requiring specific knowledge and intervention by the user. Not
being an expert in these details I would like to ask the gurus how
this could be done.
Well, the garbage collector activates whenever allocations exceed
deallocations by a certain amount, which for obvious reasons is the
reason for the bottleneck wh

My first stab at a suggestion would be to adjust the threshold
depending on how successful it is. So if a garbage collection run
collects no objects, double (or whatever) the threshold until the next
run.

More formulaicly, if your threshold is X, and a garbage collection
pass collects Y objects, the the threshold for the next pass would be
something like 2*X/(Y/X+1). Then you can create your very large data
structure in only O(N log N) time.
But as Steve Holden said, it'll probably mess someone else up.
There's not much you can do with a garbage collector to make it please
everyone.
A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.
Carl Banks
Jun 27 '08 #1
Share this Question
Share on Google+
4 Replies


P: n/a
A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.
Well, probably because you can get better
than 100x improved performance
if you don't involve the disk and use clever in memory indexing.
BTW, I think the default behaviour of the gc is
pretty silly. I tend to disable automatic gc and explicitly put in
collections when I know I'm done with some big operation these
days.
-- Aaron Watters

===
http://www.xfeedme.com/nucular/pydis...TEXT=dumb+slow
Jun 27 '08 #2

P: n/a
On Apr 14, 4:27 pm, Aaron Watters <aaron.watt...@gmail.comwrote:
A question often asked--and I am not a big a fan of these sorts of
questions, but it is worth thinking about--of people who are creating
very large data structures in Python is "Why are you doing that?"
That is, you should consider whether some kind of database solution
would be better. You mention lots of dicts--it sounds like some
balanced B-trees with disk loading on demand could be a good choice.

Well, probably because you can get better
than 100x improved performance
if you don't involve the disk and use clever in memory indexing.
Are you sure it won't involve disk use? I'm just throwing this out
there, but if you're creating a hundreds of megabytes structure in
memory there's a chance the OS will swap it out to disk, which defeats
any improvements in latency you would have gotten.

However, that is for the OP to decide. The reason I don't like the
sort of question I posed is it's presumptuous--maybe the OP already
considered and rejected this, and has taken steps to ensure the in
memory data structure won't be swapped--but a database solution should
at least be considered here.
Carl Banks
Jun 27 '08 #3

P: n/a
On Apr 14, 11:18 pm, Carl Banks <pavlovevide...@gmail.comwrote:
However, that is for the OP to decide. The reason I don't like the
sort of question I posed is it's presumptuous--maybe the OP already
considered and rejected this, and has taken steps to ensure the in
memory data structure won't be swapped--but a database solution should
at least be considered here.
Yes, you are right, especially if the index structure will be needed
many times over a long period of time. Even here though, these days,
you can go pretty far by loading everything into core (streaming from
disk) and dumping everything out when you are done, if needed
(ahem, using the preferred way to do this from python for
speed and safety: marshal ;) ).

Even with Btree's if you jump around in the tree the performance can
be
awful. This is why Nucular, for example, is designed to stream
results sequentially from disk whenever possible. The one place where
it doesn't do this very well (proximity searches) shows the most
problems with performance (under bad circumstances like searching
for two common words in proximity).
-- Aaron Watters
===
http://www.xfeedme.com/nucular/pydis...?FREETEXT=joys
Jun 27 '08 #4

P: n/a
Aaron Watters <aa***********@gmail.comwrites:
Even with Btree's if you jump around in the tree the performance can
be awful.
The Linux file cache really helps. The simplest approach is to just
"cat" the index files to /dev/null a few times an hour. Slightly
faster (what I do with Solr) is mmap the files into memory and read a
byte from each page now and then. Assuming (as in Lucene) that the
index file format is compressed, this approach is far more
ram-efficient than actually unpacking the index into data
structures. though of course you take the overhead (a few
microseconds) of a couple system calls at each access to the index
even when it's all in cache.
Jun 27 '08 #5

This discussion thread is closed

Replies have been disabled for this discussion.