By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,551 Members | 1,127 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,551 IT Pros & Developers. It's quick & easy.

Questions about bsddb

P: n/a
Hello,

I need to build a large database that has roughly 500,000 keys, and a
variable amount of data for each key. The data for each key could
range from 100 bytes to megabytes.The data under each will grow with
time as the database is being built. Are there some flags I should be
setting when opening the database to handle large amounts of data per
key? Is hash or binary tree recommended for this type of job, I'll be
building the database from scratch, so lots of lookups and appending
of data. Testing is showing bt to be faster, so I'm leaning towards
that. The estimated build time is around 10~12 hours on my machine, so
I want to make sure that something won't get messed up in the 10th
hour.

TIA,
JM

May 9 '07 #1
Share this Question
Share on Google+
4 Replies


P: n/a
On May 9, 8:23 am, sinoo...@yahoo.com wrote:
Hello,

I need to build a large database that has roughly 500,000 keys, and a
variable amount of data for each key. The data for each key could
range from 100 bytes to megabytes.The data under each will grow with
time as the database is being built. Are there some flags I should be
setting when opening the database to handle large amounts of data per
key? Is hash or binary tree recommended for this type of job, I'll be
building the database from scratch, so lots of lookups and appending
of data. Testing is showing bt to be faster, so I'm leaning towards
that. The estimated build time is around 10~12 hours on my machine, so
I want to make sure that something won't get messed up in the 10th
hour.

TIA,
JM
JM,

How will you access your data?
If you access the keys often in a sequencial manner, then bt is
better.

In general, the rule is:

1) for small data sets, either one works
2) for larger data sets, use bt. Also, bt is good for sequential key
access.
3) for really huge data sets where the metadata of the the btree
cannot even fit in the cache, the hash will be better. The reasoning
is since the metadata is larger than the cache there will be at least
an I/O operation, but with a btree there might be mulple I/O to just
find the key because the tree is not all in the memory and will have
multiple levels.

Also consider this:
I had somewhat of a similar problem. I ended up using MySQL as a
backend. In my application, the data actually was composed of a number
of fields and I wanted to select based on some of those fields as well
(i.e. select based on part of the value, not just the keys). and thus
needed to have indices for those fields. The result was that my disk I/
O was saturated (i.e. the application was running as fast as the hard
drive would let it), so it was good enough for me.

Hope this helps,
-Nick Vatamaniuc

May 9 '07 #2

P: n/a
Thanks for the info Nick. I plan on accessing the data in pretty much
random order, and once the database is built, it will be read only.
At this point Im not too concerned about access times, just getting
something to work. I've been messing around with both bt and hash with
limited success, which led me to think that maybe I was going beyond
some internal limit for the data size.It works great on a limited set
of data, but once I turn it loose on the full set, usually several
hours later, it either causes a hard reset of my machine or the HD
grinds on endlessly with no apparent progress. Is there a limit to
the size of data you can place per key?

Thanks for the MySQL suggestion, I'll take a look.

-JM

May 9 '07 #3

P: n/a
On May 9, 4:01 pm, sinoo...@yahoo.com wrote:
Thanks for the info Nick. I plan on accessing the data in pretty much
random order, and once the database is built, it will be read only.
At this point Im not too concerned about access times, just getting
something to work. I've been messing around with both bt and hash with
limited success, which led me to think that maybe I was going beyond
some internal limit for the data size.It works great on a limited set
of data, but once I turn it loose on the full set, usually several
hours later, it either causes a hard reset of my machine or the HD
grinds on endlessly with no apparent progress. Is there a limit to
the size of data you can place per key?

Thanks for the MySQL suggestion, I'll take a look.

-JM
JM,

If you want, take a look at my PyDBTable on www.psipy.com.

The description and the examples section is being finished but the
source API documentation will help you.

It is a fast Python wrapper around MySQL, PostgreSQL or SQLite. It is
very fast and buffers queries and insertions. You just set up the
database and then pass the connection parameters to the initializer
method. After that you can use the pydb object as a dictionary of
{ primary_key : list_of_values }. You can even create indices on
individual fields and query with queries like :
---------------------------------------------------------------------------------------------------------
pydb.query( ['id','data_field1'], ('id','<',10),
('data_field1','LIKE','Hello%') )
--------------------------------------------------------------------------------------------------------

Which will translate into the SQL query like :

----------------------------------------------------------------------------------------------------------------------
SELECT id, data_field1 FROM ... WHERE id<10 AND data_field1 LIKE 'Hello
%'
----------------------------------------------------------------------------------------------------------------------

and return an __iterator__.

The iterator as a the result is excellent because you can iterate over
results much larger than your virtual memory. But in the background
PyDBTable will retrieve rows from the database in large batches and
cache them as to optimise I/O.

Anyway, on my machine PyDBTable saturates the disk I/O (it runs as
fast as a pure MySQL query).

Take care,
-Nick Vatamaniuc

May 10 '07 #4

P: n/a
Thanks for the suggestion, I do remember reading that, but I don't
think that helped much. I found experimenting around with the
different settings, that the cache size is where the problem was. I've
got it set to 1.5 GB and it's pretty happy at the moment, and the
reduction in build time is a fraction of what it used to be. Thanks
again for all the suggestions.

Regards,
JM


May 11 '07 #5

This discussion thread is closed

Replies have been disabled for this discussion.