Large Database System

raidvvan

Hi there,

We have been looking for some time now for a database system that can
fit a large distributed computing project, but we haven't been able to
find one.
I was hoping that someone can point us in the right direction or give
us some advice.

Here is what we need. Mind you, these are ideal requirements so we do
not expect to find something that fits entirely into what we need
but we hope to get somewhat closer to that.

We need a database/file system:
1. built in C preferrably ANSI C, so that we can port it to Linux/
Unix, Windows, Mac and various other platforms; if it can work on
Linux only then it is OK for now
2. that has a public domain or GPL/LGPL licence and source code access
3. uses hashing or b-trees or a similar structure
4. has support for files in the range of 1-10 GB; if it can get to 1
GB only, that should still be OK
5. can work with an unlimited number of files on a local machine; we
don't need access over a network, just local file access
6. that is fairly simple (i.e. library-style, key/data records); it
doesn't have to have SQL support of any kind; as long as we can add,
update, possibly delete data, browse through the records and filter/
query them it should be OK; no other features are required, like
backup, restore, users & security, stored procedures...
7. reliable if possible
8 .local transactional support if possible; there is no need for
distributed transactions
9. fast data access if possible

We can not use any of the major commercial databases (e.g. Oracle, SQL
Server, DB2 or larger systems like Daytona...) obviously because of
licensing and source code issues. We looked closer to MySQL,
PostgreSQL but they are too big and have way too many features that we
do not need. We need to be able to install a database/file system on
possibly tens of thousands of machines and we also expect it to work
without administration.
On top of that, we might end up with thousands of files of different
sizes on each machine. Are there any embedded (i.e. "lighter")
versions of these two databases?
We haven't been able to find anything like that. I am not sure how
much work would involve in "trimming" down some of these databases,
but that doesn't seem to be too easy to do.
Berkeley-DB would have been the best but is now under Oracle hands and
the licence has changed. TinyCDB was a close call, but the fact
that we need to rebuild the database for each data update is making it
unfeasible for large files (i.e. ~1Gb). SQL Lite is very interesting,
but it has many features that we don't need, like SQL support.

Right now we are using plain XML files so anything else would be a
great improvement.

Any suggestions or links to sites or papers or books would be welcome.
Any help would be greatly appreciated.

If this is not in the proper forum I appreciate if someone can move
the post to the right location or point us to the right one.

Thanks in advance.

Best regards,
Ovidiu Anghelidi
ov****@intellig encerealm.com

Artificial Intelligence - Reverse Engineering The Brain

Oct 19 '07 #1

Subscribe Reply

2722

user923005

On Oct 19, 7:54 am, raidv...@yahoo. com wrote:

Hi there,

We have been looking for some time now for a database system that can
fit a large distributed computing project, but we haven't been able to
find one.
I was hoping that someone can point us in the right direction or give
us some advice.

Here is what we need. Mind you, these are ideal requirements so we do
not expect to find something that fits entirely into what we need
but we hope to get somewhat closer to that.

We need a database/file system:
1. built in C preferrably ANSI C, so that we can port it to Linux/
Unix, Windows, Mac and various other platforms; if it can work on
Linux only then it is OK for now
2. that has a public domain or GPL/LGPL licence and source code access
3. uses hashing or b-trees or a similar structure
4. has support for files in the range of 1-10 GB; if it can get to 1
GB only, that should still be OK
5. can work with an unlimited number of files on a local machine; we
don't need access over a network, just local file access
6. that is fairly simple (i.e. library-style, key/data records); it
doesn't have to have SQL support of any kind; as long as we can add,
update, possibly delete data, browse through the records and filter/
query them it should be OK; no other features are required, like
backup, restore, users & security, stored procedures...
7. reliable if possible
8 .local transactional support if possible; there is no need for
distributed transactions
9. fast data access if possible

We can not use any of the major commercial databases (e.g. Oracle, SQL
Server, DB2 or larger systems like Daytona...) obviously because of
licensing and source code issues. We looked closer to MySQL,
PostgreSQL but they are too big and have way too many features that we
do not need. We need to be able to install a database/file system on
possibly tens of thousands of machines and we also expect it to work
without administration.

Say that last sentence out loud in front of a group of DBAs and I
guess you will get a little bit of mirth. This statement alone is
proof that your project will fail. Every database system (even simple
keysets like the Sleepycat database) needs administration.

Listen, you are going to have tens of thousands of points of failure
in your system. Is that what you really want? If you have (for
instance) 20,000 machines getting a big pile of data shoved down their
throat, you pretty much have a guarantee that a few hundred are going
to be out of space and that once a month a disk drive is going to fail
somewhere.

On top of that, we might end up with thousands of files of different
sizes on each machine. Are there any embedded (i.e. "lighter")
versions of these two databases?

Do you know what happens to performance when you put thousands of
active files on a machine? Pretend that you are a disk head and
imagine the jostling you are going to receive.

We haven't been able to find anything like that. I am not sure how
much work would involve in "trimming" down some of these databases,
but that doesn't seem to be too easy to do.

They are the size that they are for a reason. It's not fat that gets
trimmed off to scale things down, it's muscle.

Berkeley-DB would have been the best but is now under Oracle hands and
the licence has changed. TinyCDB was a close call, but the fact
that we need to rebuild the database for each data update is making it
unfeasible for large files (i.e. ~1Gb). SQL Lite is very interesting,
but it has many features that we don't need, like SQL support.

You do know that SQLite is a single user database?

Right now we are using plain XML files so anything else would be a
great improvement.

I'll say.

Any suggestions or links to sites or papers or books would be welcome.
Any help would be greatly appreciated.

If this is not in the proper forum I appreciate if someone can move
the post to the right location or point us to the right one.

The right thing to do is go to SourceForge and execute a few
searches. The pedagogic answer to to refer to newsgroup
news:comp.sourc es.wanted, but it's a ghost town.

I suspect that you have no idea what you are doing. Do you have any
concept about what is going to happen when your problem scales to
10GB? Get a consultant who understands the problem space or you'll be
sorry. By the way, this is definitely not the right forum for your
post -- which does not exactly make it appear that you have anything
on the ball. (Really a newsgroup post in general is the wrong
approach here).

I guess that FastDB or GigaBase might be suitable (WARNING! One
writer at a time). I also guess that you are going to severely need
the capabilities that you do not think you need at some point.
http://www.garret.ru/~knizhnik/databases.html

Another possibility is QDBM:
http://sourceforge.net/projects/qdbm/
I guess that you will like this one but also that it is the wrong
choice.

I don't know anything about your project but I think you need to
rethink your big picture of how you are going to solve it.

Oct 19 '07 #2

raidvvan

Hi there,

Thank you for taking the time to answer to the post.

Say that last sentence out loud in front of a group of DBAs and I
guess you will get a little bit of mirth. This statement alone is
proof that your project will fail.

We are using the BOINC distributed architecture for now, which seems
to already be working on hundreds of thousands of machines. We want to
add database capabilities to the data files that are being processed.

Every database system (even simple
keysets like the Sleepycat database) needs administration.

Because of the sheer number of machines involved in computations we
need to avoid that. If it will not be possible we'll have to stick
with XML until we find something better.

Listen, you are going to have tens of thousands of points of failure
in your system. Is that what you really want? If you have (for
instance) 20,000 machines getting a big pile of data shoved down their
throat, you pretty much have a guarantee that a few hundred are going
to be out of space and that once a month a disk drive is going to fail
somewhere

That is not a concern. When you are going to have the same data
replicated on 1 to 10 machines, reliability is no longer becoming an
issue.

Do you know what happens to performance when you put thousands of
active files on a machine? Pretend that you are a disk head and
imagine the jostling you are going to receive.

I haven't been able to provide more details about the project but most
of the data will be historic in nature. Once a calculation is
performed that data will be stored and in most cases will no longer be
active. It will still be needed though. So having thousands of files
on a machine is not so bad. This is a not a classic database
application and that is why it probably seems strange, that features
like reliability which should be on top, are listed as last and are
not a concern.

They are the size that they are for a reason. It's not fat that gets
trimmed off to scale things down, it's muscle.
You do know that SQLite is a single user database?

That is exactly what we need. Data will be sent over the Internet to
other machines which will also use a single user database.

The right thing to do is go to SourceForge and execute a few
searches. The pedagogic answer to to refer to newsgroup
news:comp.sourc es.wanted, but it's a ghost town.

I have looked over there but I should probably search again.
Thank you.

I suspect that you have no idea what you are doing. Do you have any
concept about what is going to happen when your problem scales to
10GB?

If things go bad at 10 GB we can just go with 1 GB or if 1 GB is not
good we can go with 100 MB. We can always increase the number of files
and distribute the data on more machines. The ideal solution is to
have the data in large compact files.

Get a consultant who understands the problem space or you'll be
sorry. By the way, this is definitely not the right forum for your
post -- which does not exactly make it appear that you have anything
on the ball. (Really a newsgroup post in general is the wrong
approach here).
I guess that FastDB or GigaBase might be suitable (WARNING! One
writer at a time). I also guess that you are going to severely need
the capabilities that you do not think you need at some point.http://www.garret.ru/~knizhnik/databases.html

Finding out about these two databases is a step forward and it seems
that is was worthy to post it in here. Again, I appreciate your time.
We will look closer at these.

Another possibility is QDBM:http://sourceforge.net/projects/qdbm/
I guess that you will like this one but also that it is the wrong
choice.

I have looked at it before. It appears to be quite new and there are
not many people using it, and we do not want to go a narrow road that
is less traveled.

I don't know anything about your project but I think you need to
rethink your big picture of how you are going to solve it.

Thanks again.
Ovidiu

Oct 21 '07 #3

CBFalconer

ra******@yahoo. com wrote:

>
We have been looking for some time now for a database system that
can fit a large distributed computing project, but we haven't been
able to find one. I was hoping that someone can point us in the
right direction or give us some advice.

Here is what we need. Mind you, these are ideal requirements so we
do not expect to find something that fits entirely into what we
need but we hope to get somewhat closer to that.

We need a database/file system:
1. built in C preferrably ANSI C, so that we can port it to Linux/
Unix, Windows, Mac and various other platforms; if it can work
on Linux only then it is OK for now
2. that has a public domain or GPL/LGPL licence and source code
access
3. uses hashing or b-trees or a similar structure
4. has support for files in the range of 1-10 GB; if it can get to
1 GB only, that should still be OK
5. can work with an unlimited number of files on a local machine;
we don't need access over a network, just local file access
6. that is fairly simple (i.e. library-style, key/data records);
it doesn't have to have SQL support of any kind; as long as we
can add, update, possibly delete data, browse through the
records and filter/query them it should be OK; no other
features are required, like backup, restore, users & security,
stored procedures...
7. reliable if possible
8 .local transactional support if possible; there is no need for
distributed transactions
9. fast data access if possible

We already have such a thing in hashlib, with the exception of the
ability to easily store and recall from external files. Such a
facility can be added, but requires that the file mechanism knows
all about the structure of the database. So far hashlib is
completely independent of such structure. Available under GPL,
written in purely standard C:

<http://cbfalconer.home .att.net/download/>

--
Chuck F (cbfalconer at maineline dot net)
Available for consulting/temporary embedded and systems.
<http://cbfalconer.home .att.net>

--
Posted via a free Usenet account from http://www.teranews.com

Oct 22 '07 #4

user923005

On Oct 20, 10:53 pm, raidv...@yahoo. com wrote:

Hi there,

Thank you for taking the time to answer to the post.

Say that last sentence out loud in front of a group of DBAs and I
guess you will get a little bit of mirth. This statement alone is
proof that your project will fail.

We are using the BOINC distributed architecture for now, which seems
to already be working on hundreds of thousands of machines. We want to
add database capabilities to the data files that are being processed.

Every database system (even simple
keysets like the Sleepycat database) needs administration.

Because of the sheer number of machines involved in computations we
need to avoid that. If it will not be possible we'll have to stick
with XML until we find something better.

I do not understand why you don't want to store your thousands of
files on a single database server and then let the machines check out
problems from the database server. It seems a much less complicated
solution to me. The administration is now confined to a single
machine.

I imagine it like this:
The data is loaded into a single, whopping database on a very hefty
machine (Ultra320 SCSI disks, 20 GB ram, 4 cores or more). There is a
"problem" table that stores the data set information and also the
status of the problem (e.g. 'verified', 'solved', 'checked out',
'unsolved'). The users connect to the database and check out unsolved
problems until all are solved and then check out solved problems until
all of them are verified.

Listen, you are going to have tens of thousands of points of failure
in your system. Is that what you really want? If you have (for
instance) 20,000 machines getting a big pile of data shoved down their
throat, you pretty much have a guarantee that a few hundred are going
to be out of space and that once a month a disk drive is going to fail
somewhere

That is not a concern. When you are going to have the same data
replicated on 1 to 10 machines, reliability is no longer becoming an
issue.

What if you get 4 different answers? What if the data is damaged on
three of them? Reliability is always an issue. The more complicated
the system, the more difficult it will become to verify validity of
your answers.

Do you know what happens to performance when you put thousands of
active files on a machine? Pretend that you are a disk head and
imagine the jostling you are going to receive.

I haven't been able to provide more details about the project but most
of the data will be historic in nature. Once a calculation is
performed that data will be stored and in most cases will no longer be
active. It will still be needed though. So having thousands of files
on a machine is not so bad. This is a not a classic database
application and that is why it probably seems strange, that features
like reliability which should be on top, are listed as last and are
not a concern.

Data reliability is always a concern. If you cannot verify the
reliability of the data, then nobody should trust your answers.

They are the size that they are for a reason. It's not fat that gets
trimmed off to scale things down, it's muscle.
You do know that SQLite is a single user database?

That is exactly what we need. Data will be sent over the Internet to
other machines which will also use a single user database.

How will you coordinate who is working on what steps of the problem?

The right thing to do is go to SourceForge and execute a few
searches. The pedagogic answer to to refer to newsgroup
news:comp.sourc es.wanted, but it's a ghost town.

I have looked over there but I should probably search again.
Thank you.

I suspect that you have no idea what you are doing. Do you have any
concept about what is going to happen when your problem scales to
10GB?

If things go bad at 10 GB we can just go with 1 GB or if 1 GB is not
good we can go with 100 MB. We can always increase the number of files
and distribute the data on more machines. The ideal solution is to
have the data in large compact files.

Get a consultant who understands the problem space or you'll be
sorry. By the way, this is definitely not the right forum for your
post -- which does not exactly make it appear that you have anything
on the ball. (Really a newsgroup post in general is the wrong
approach here).
I guess that FastDB or GigaBase might be suitable (WARNING! One
writer at a time). I also guess that you are going to severely need
the capabilities that you do not think you need at some point.http://www.garret.ru/~knizhnik/databases.html

Finding out about these two databases is a step forward and it seems
that is was worthy to post it in here. Again, I appreciate your time.
We will look closer at these.

Another possibility is QDBM:http://sourceforge.net/projects/qdbm/
I guess that you will like this one but also that it is the wrong
choice.

I have looked at it before. It appears to be quite new and there are
not many people using it, and we do not want to go a narrow road that
is less traveled.

I don't know anything about your project but I think you need to
rethink your big picture of how you are going to solve it.

Since single user data access is what you are after, FastDB might be
interesting. If you compile it for 64 bit UNIX you can have files of
arbitrary size, and they are memory mapped so access should be very
fast. I have done experiments with FastDB and its performance is
quite good. You can use it as a simple file source but it also has
advanced capabilities. The footprint is very small.

I think we shold move the discussions to news:comp.progr amming, and so
I have set the follow-ups.

Oct 22 '07 #5

Similar topics

2397

Python for Large Projects

by: Ixokai | last post by:

Hello all. :) I've been a long time Python fan, and have fairly recently (with the support of a coworker who surprised me by mentioning my pet language one day) convinced my company to begin the colossal task of basically rewriting all of our software in Python. Woohoo. Previously we used a few different development environments, mostly Borland, for different products in our 'system' of thick clients sort of operating with eachother as...

Python

6359

Is there a "Large Scale Python Software Design" ?

by: Andrea Griffini | last post by:

I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but something I'll need in this case is some experience-based set of rules about how to use python in this context. For example... is defining readonly attributes in classes worth the hassle ? Does duck-typing scale well in complex

Python

6196

Going for a LARGE Table: Any Tips?

by: Good Man | last post by:

Hi there I'm developing a large web application. Part of this web application will be storing numerical chart data in a MySQL table - these numbers will be already calculated, and are just being stored for reference. In this particular table, the stored data will never be deleted or changed. The only actions performed will be SELECTs and INSERTs. There will never be any DELETEs or UPDATEs.

MySQL Database

8103

How large of a bufferpool can you create?

by: Hemant Shah | last post by:

Folks, I am using DB2 UDB 8.2 on AIX 5.1. How large of a bufferpool can you create? I tried to create a 4GB bufferpool db2 complained that is cannot allocate enogth memory. I have 16GB on this system. # db2 create bufferpool cfgbuffpool immediate size 1048576 pagesize 4096 SQL20189W The buffer pool operation (CREATE/ALTER) will not take effect until

DB2 Database

4026

Working with Large vs Small databases

by: Salad | last post by:

Every now and then I see ads that state something like "Experience with Large Databases ...multi-gig...blah-de-blah" And I have to laugh. What's the difference between a large or small database? A table is a table, a record is a record, a field is a field. All you are doing is manipulating data in tables. I wouldn't think it'd make much difference in working with a table with 10 records or a billion records...they're nothing more than...

Microsoft Access / VBA

5459

MySQL + LARGE innodb = thrashing HDD

by: daniel | last post by:

I have the following scenario. A mysql database running 3 databases. It is version 5.0.27 on Windows XP Prof.. All innodb databases. The one database is particularly large (7.8GB of data)...pretty much held in 1 table....there are probabably 30 tables in the rest of the databases....combined they probably take up 200MB. The machine is pretty well spec'ed AMD X2 4600+, 2GB RAM, SATA RAID1. Normally the services that use the databases are...

MySQL Database

4253

Large web site, need to do some major rearrangement of files...

by: mike | last post by:

I help manage a large web site, one that has over 600 html pages... It's a reference site for ham radio folks and as an example, one page indexes over 1.8 gb of on-line PDF documents. The site is structured as an upside-down tree, and (if I remember correctly) never more than 4 levels. The site basically grew (like the creeping black blob) ... all the pages were created in Notepad over the last

HTML / CSS

6383

which is the better option for directory hashing to store large number of image files?

by: theCancerus | last post by:

Hi All, I am not sure if this is the right place to ask this question but i am very sure you may have faced this problem, i have already found some post related to this but not the answer i am looking for. My problem is that i have to upload images and store them. I am using filesystem for that. setup is something like this, their will be items/groups/user each can

PHP

20543

"Record is too large"?

by: tekctrl | last post by:

Anyone: I have a simple MSAccess DB which was created from an old ASCII flatfile. It works fine except for something that just started happening. I'll enter info in a record, save the record, and try to move to another record and get an Access error "Record is too large". The record is only half filled, with many empty fields. If I remove the added data or delete some older data, then it saves ok and works fine again. Whenever I'm...

Microsoft Access / VBA

8203

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

8146

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

8647

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

8297

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8449

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

5550

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4063

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

1759

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

1445

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General