473,729 Members | 2,371 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Questions about bsddb

Hello,

I need to build a large database that has roughly 500,000 keys, and a
variable amount of data for each key. The data for each key could
range from 100 bytes to megabytes.The data under each will grow with
time as the database is being built. Are there some flags I should be
setting when opening the database to handle large amounts of data per
key? Is hash or binary tree recommended for this type of job, I'll be
building the database from scratch, so lots of lookups and appending
of data. Testing is showing bt to be faster, so I'm leaning towards
that. The estimated build time is around 10~12 hours on my machine, so
I want to make sure that something won't get messed up in the 10th
hour.

TIA,
JM

May 9 '07 #1
4 1352
On May 9, 8:23 am, sinoo...@yahoo. com wrote:
Hello,

I need to build a large database that has roughly 500,000 keys, and a
variable amount of data for each key. The data for each key could
range from 100 bytes to megabytes.The data under each will grow with
time as the database is being built. Are there some flags I should be
setting when opening the database to handle large amounts of data per
key? Is hash or binary tree recommended for this type of job, I'll be
building the database from scratch, so lots of lookups and appending
of data. Testing is showing bt to be faster, so I'm leaning towards
that. The estimated build time is around 10~12 hours on my machine, so
I want to make sure that something won't get messed up in the 10th
hour.

TIA,
JM
JM,

How will you access your data?
If you access the keys often in a sequencial manner, then bt is
better.

In general, the rule is:

1) for small data sets, either one works
2) for larger data sets, use bt. Also, bt is good for sequential key
access.
3) for really huge data sets where the metadata of the the btree
cannot even fit in the cache, the hash will be better. The reasoning
is since the metadata is larger than the cache there will be at least
an I/O operation, but with a btree there might be mulple I/O to just
find the key because the tree is not all in the memory and will have
multiple levels.

Also consider this:
I had somewhat of a similar problem. I ended up using MySQL as a
backend. In my application, the data actually was composed of a number
of fields and I wanted to select based on some of those fields as well
(i.e. select based on part of the value, not just the keys). and thus
needed to have indices for those fields. The result was that my disk I/
O was saturated (i.e. the application was running as fast as the hard
drive would let it), so it was good enough for me.

Hope this helps,
-Nick Vatamaniuc

May 9 '07 #2
Thanks for the info Nick. I plan on accessing the data in pretty much
random order, and once the database is built, it will be read only.
At this point Im not too concerned about access times, just getting
something to work. I've been messing around with both bt and hash with
limited success, which led me to think that maybe I was going beyond
some internal limit for the data size.It works great on a limited set
of data, but once I turn it loose on the full set, usually several
hours later, it either causes a hard reset of my machine or the HD
grinds on endlessly with no apparent progress. Is there a limit to
the size of data you can place per key?

Thanks for the MySQL suggestion, I'll take a look.

-JM

May 9 '07 #3
On May 9, 4:01 pm, sinoo...@yahoo. com wrote:
Thanks for the info Nick. I plan on accessing the data in pretty much
random order, and once the database is built, it will be read only.
At this point Im not too concerned about access times, just getting
something to work. I've been messing around with both bt and hash with
limited success, which led me to think that maybe I was going beyond
some internal limit for the data size.It works great on a limited set
of data, but once I turn it loose on the full set, usually several
hours later, it either causes a hard reset of my machine or the HD
grinds on endlessly with no apparent progress. Is there a limit to
the size of data you can place per key?

Thanks for the MySQL suggestion, I'll take a look.

-JM
JM,

If you want, take a look at my PyDBTable on www.psipy.com.

The description and the examples section is being finished but the
source API documentation will help you.

It is a fast Python wrapper around MySQL, PostgreSQL or SQLite. It is
very fast and buffers queries and insertions. You just set up the
database and then pass the connection parameters to the initializer
method. After that you can use the pydb object as a dictionary of
{ primary_key : list_of_values }. You can even create indices on
individual fields and query with queries like :
---------------------------------------------------------------------------------------------------------
pydb.query( ['id','data_fiel d1'], ('id','<',10),
('data_field1', 'LIKE','Hello%' ) )
--------------------------------------------------------------------------------------------------------

Which will translate into the SQL query like :

----------------------------------------------------------------------------------------------------------------------
SELECT id, data_field1 FROM ... WHERE id<10 AND data_field1 LIKE 'Hello
%'
----------------------------------------------------------------------------------------------------------------------

and return an __iterator__.

The iterator as a the result is excellent because you can iterate over
results much larger than your virtual memory. But in the background
PyDBTable will retrieve rows from the database in large batches and
cache them as to optimise I/O.

Anyway, on my machine PyDBTable saturates the disk I/O (it runs as
fast as a pure MySQL query).

Take care,
-Nick Vatamaniuc

May 10 '07 #4
Thanks for the suggestion, I do remember reading that, but I don't
think that helped much. I found experimenting around with the
different settings, that the cache size is where the problem was. I've
got it set to 1.5 GB and it's pretty happy at the moment, and the
reduction in build time is a fraction of what it used to be. Thanks
again for all the suggestions.

Regards,
JM


May 11 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1475
by: John D. | last post by:
We are using OpenBSD latest release of Python (2.3), and want to use the "bsddb" module. Our python release has a directory called "bsddb", but when we try and import it, it says >>> import bsddb Traceback (most recent call last): File "<stdin>", line 1, in ? File "/usr/local/lib/python2.3/bsddb/__init__.py", line 40, in ? import _bsddb ImportError: No module named _bsddb Where can I find this module, and how do I get it to...
0
1520
by: Jane Austine | last post by:
There is a test code named test_env_close in bsddb/test, but it doesn't test the case thoroughly. There seems to be a bug in closing the db environment first -- the lock is not released, and sometimes it seg-faults. Following is the code that shows this bug. <code> import os from bsddb import db
3
2684
by: Harry Pehkonen | last post by:
Stats: Python2.3 windows2000 professional If I have ``Full Control'' of a bsddb file, no problem: >>> import bsddb >>> a = bsddb.btopen("c:/sharedrw/db/npanxx2pseudo.db", "r") >>> a.close() >>>
6
2006
by: Dfenestr8 | last post by:
I started off trying to use bsddb with my standard mandrake 9 python 2.2.1 package. The shell reported back no such module. Strange, thinks I. I thought it was standard in the package, it was when I was using Windows. So I uninstalled Python, downloaded the latest python 2.3 and compiled it from source. Now when I try to import bsddb, I get this error...... >>> import bsddb
4
4487
by: Michele Simionato | last post by:
I was browsing through the source tree of Python 2.4b2 and in Lib/bsddb/test I found a lot of interesting stuff. It seems that the support for the bsd database is much better than documented in the standard library http://www.python.org/dev/doc/devel/lib/module-bsddb.html. There are tests for locking, transactions, join, etc. Is there some documentation anywhere? I could infer what I need from the tests but maybe there is already some...
0
1340
by: Barry | last post by:
I have python2.4.1 installed on two machines: -- one is Fedora core 1, where the bsddb module works fine -- one is Redhat ES 3.0, and I installed mysql 4.1 (and mysql-python2.1) after putting the newer python on the machine. python2.2, which came with Redhat ES, works fine, so I suppose I messed up the build. I much appreciate any insight in how to fix this.
0
1250
by: Yi Qiang | last post by:
Hi guys, I am trying to compile python 2.5 on my OSX machine so it includes the bsddb module. Currently, when I type 'import bsddb' I get the following traceback: /Users/yi/Software/sage-1.4.1.2/local/lib/python2.5/bsddb/__init__.py in <module>() 49 from bsddb3.dbutils import DeadlockWrap as _DeadlockWrap 50 else: ---51 import _bsddb
2
2539
by: lazy | last post by:
Hi, I have a dictionary something like this, key1=>{key11=> , key12=> , .... } For lack of wording, I will call outer dictionary as dict1 and its value(inner dictionary) dict2 which is a dictionary of small fixed size lists(2 items) The key of the dictionary is a string and value is another dictionary
1
1750
by: BjornT | last post by:
Have a rather big problem with bsddb I can't figure out. I upgraded an Ubuntu machine from 7.05 to 7.10 which upgraded python to 2.5.1 I run a local website that uses bsddb, and suddenly I get a version mismatch error with bsddb and I can't access my database anymore I noticed in the 2.5.1 release notes: - fixed a bug with bsddb.DB.stat: the flags and txn keyword arguments were transposed.
2
1689
by: cocobear | last post by:
How to deal with multiple databases in an file. I want to get the content of several databases. it's the code I wrote: $ python Python 2.5.1 (r251:54863, Oct 30 2007, 13:54:11) on linux2 Type "help", "copyright", "credits" or "license" for more information.
0
8921
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8763
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
9284
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9202
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9148
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
4528
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
4796
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
3238
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
2683
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.