huge dictionary

barbaros

Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,
since I need to keep the dictionary for future analysis.
Except if I save it in the end to another partition ...
why not, actually ?

Another solution would be to use an external database
like SQLite or Metakit.

Which solution would you recommend ?
Speed of execution is very important.

Thanks in advance for any hint !

Yours, Cristian Barbarosie
http://cmaf.fc.ul.pt/~barbaros

Oct 9 '05 #1

Subscribe Post Reply

2468

Willem

barbaros wrote:
) Hello everybody,
)
) I need some advice on the following problem.
)
) ...
)
) Another solution would be to use an external database
) like SQLite or Metakit.

Personally, I'd say that this is the reccommended solution.
Someone else already solved all the difficult problems for you,
so why not use it ? Should be quite easy to implement, as well.
SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Oct 9 '05 #2

barbaros

> Personally, I'd say that this is the reccommended solution.

OK, but would you reccommend any of the two (SQLite or Metakit) ?
Thanks. Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros

Oct 9 '05 #3

Michael Schneider

If their license matches your needs, I would recommend
Berkley bsd from sleepycat.

http://www.sleepycat.com

Mike
barbaros wrote:

Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,
since I need to keep the dictionary for future analysis.
Except if I save it in the end to another partition ...
why not, actually ?

Another solution would be to use an external database
like SQLite or Metakit.

Which solution would you recommend ?
Speed of execution is very important.

Thanks in advance for any hint !

Yours, Cristian Barbarosie
http://cmaf.fc.ul.pt/~barbaros

--
The greatest performance improvement occurs on the transition of from
the non-working state to the working state.

Oct 9 '05 #4

moi

barbaros wrote:

Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
How many records ?
[ most of the records's space will be consumed by
the (variable length ?) strings, which could be
compressed by using tries , skip lists]

(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,
You could mmap, but the data still has to fit within your
address space, which will have to be biggen than 32bit,
if you want 10GB data
[There will also be a problem with appends/inserts, but that
can be handled, probably ]
since I need to keep the dictionary for future analysis.
Except if I save it in the end to another partition ...
why not, actually ?
Syntax error.
Another solution would be to use an external database
like SQLite or Metakit.

Which solution would you recommend ?
Speed of execution is very important.
Which speed: throughput, or response ?

Thanks in advance for any hint !

Yours, Cristian Barbarosie
http://cmaf.fc.ul.pt/~barbaros

One way or the other, you data will have to be stored on disk.

Even in the ideal case where you store the hashtable or index
in core, and the data on disk, each retrieve operation will
cost you one disk I/O.
You can only save on I/O 's by clustering your data: putting data that
is accessed together on the same disk block, or at least close to it.
Depends on your access pattern. (which you probably don't know in advance)

BTW for a similar problem:
http://www-db.stanford.edu/~backrub/google.html
They manage to keep the 'dictionary' in core.
(but they probably can afford to forget/ignore some 'strings' )

HTH,
AvK

Oct 9 '05 #5

Richard Tobin

In article <43**************@fuse.net>,
Michael Schneider <mi**************@fuse.net> wrote:

If their license matches your needs, I would recommend
Berkley bsd from sleepycat.

You mean Berkeley DB.

-- Richard

Oct 9 '05 #6

Ian

barbaros wrote:

Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

You don't provide enough detail.

Where do you require the performance, on create, read or modify? Are
you constrained by performance or budget?

If access performance is key and you have the budget, use enough RAM to
hold the data and serialise it to disk on modify.

Ian

Oct 9 '05 #7

Michael Schneider

Richard Tobin wrote:

In article <43**************@fuse.net>,
Michael Schneider <mi**************@fuse.net> wrote:

If their license matches your needs, I would recommend
Berkley bsd from sleepycat.

You mean Berkeley DB.

-- Richard

Yes, fat fingered the DB,

Thanks for the catch,
Mike

--
The greatest performance improvement occurs on the transition of from
the non-working state to the working state.

Oct 10 '05 #8

Sandeep

>>I want to write a program (in C or C++) which will

build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.

You can consider writing ( or using an existing code if there is one
freely available) a B+ Tree. This is typically how database writers
implement the indexing mechanism. You can then choose where to create
the file in this case.
Again, B+ tree is just an example. Based on your needs you can optimize
it to different datastructures.

Using an existing database is a great idea ! (If getting a license is
not an issue.)

Oct 10 '05 #9

Branimir Maksimovic

"barbaros" <ba******@ptmat.fc.ul.pt> wrote in message
news:11**********************@z14g2000cwz.googlegr oups.com...

Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,
since I need to keep the dictionary for future analysis.
Well no need for swap just mmap ordinary file and that's it.
Just remember not to store real pointers but offset from
what mmap returns. On 64 bit system you can easily
mmap 10gb but on 32 bit not so. There you have to mmap
piece at a time.

Another solution would be to use an external database
like SQLite or Metakit.
That would be easy.

Which solution would you recommend ?
Speed of execution is very important.

Go for mmap if you wan't speed, but this has cost of
additional work :)

Greetings, Bane.

Oct 10 '05 #10

barbaros

Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
meaningfull), so I need relatively quick write access.
Because it's so large I need to save (parts of) it on
the hard disk, and in the end to save it all on disk.
As about budget restrictions, I am commited to use
only free software (open source, to be more precise).

mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)

Thank you everybody for your answers.
Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros

Oct 10 '05 #11

Maxim Yegorushkin

barbaros wrote:

Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
meaningfull), so I need relatively quick write access.
Because it's so large I need to save (parts of) it on
the hard disk, and in the end to save it all on disk.
As about budget restrictions, I am commited to use
only free software (open source, to be more precise).
Use boost::multi_index for that. hashed indexes might be what you need.
mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?
Check uname command output.
In C/C++ code you can get sizeof(void*).

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)

This is called digital search tree (DST). It consumes a lot of memory
for nodes. A ternary tree has similar performance charasteristics to
that of DST but consumes less memory.

Oct 10 '05 #12

barbaros

moi wrote:

BTW for a similar problem:
http://www-db.stanford.edu/~backrub/google.html
They manage to keep the 'dictionary' in core.
(but they probably can afford to forget/ignore some 'strings' )

Yes, you are perfectly right to point out this similar problem.
Thank you for the link.

Oct 10 '05 #13

Branimir Maksimovic

"barbaros" <ba******@ptmat.fc.ul.pt> wrote in message
news:11*********************@g49g2000cwa.googlegro ups.com...

Some say I don't provide enough detail.
Let me see what can I add ...

mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?
Well you don't need too. Just organize pages.
Let's say you have page table in memory with pointer
to mmaped block , and flag for in memory
/out of memory status. Index into table represents
page number. Then you only need to store page number
and offset into page in say two integers, which will represent
pointer in you map object. Let's say page = 0x1 offset = 0x4000
then table[1].mmaped_pointer + 0x4000 gives real address.
You have to check if table[1].in_memory = true if not
mmap that page and update
table[page_num].mmaped_pointer = mmap(...,page_num*page_size);
then table[1].in_memory = true. You can put also access counter
table[1].acc_count and unmap pages that are not used so you
keep some maximum number of in memory pages.
This is normaly done by os but if you don't have enough address
space you have to do that too:)
When retreiving data you just have to know where is first node of
your data structure. Let's say it's on page 0 offset 0x0.

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.

You can use some hashing algorithm, and just store
hash values instead of strings.

Greetings, Bane.

Oct 10 '05 #14

moi

barbaros wrote:

Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
Mmap() *can* be elegant, but gets ugly in the
presence of writes/appends to the same file (as in your
case).
Also, there is *no performance benefit* over write().
Mmap just maps a piece of your diskfile into memory,
but underneath just as much disk I/O is needed.
In the ultimate case (unclustered read access) you end up with one read
per record-fetch.
Writing/appending is always faster (since multiple records can fit into
one one disk block)
Using a DB library does not change this, the library still has to
do the reading/writing on your behalf, and you en up with 10ms readtime
per block. It *can* save you some development effort.

How do I know whether linux on a recent intel box is
32 or 64 bits ?
What another replyer suggested :
printf("%u", (unsigned) sizeof (void*));

If the machine is 64 bits you can install/compile a 64 bits
linux-distribution onto it.
Linux has 64 bit support since about 1994 (alpha/PPC)

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...
This is basically a tree-like structure.
Google for tree, trie, patricia or skiplist for more.
As another replyer suggested: tree-like structures can take
a lot of (memory) space, when badly designed/applied (eg when most of
the pointers are NULL)
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)
It was more about semantics. The paragraph made no sense to me.
Thank you everybody for your answers.
Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros

Good luck,
AvK

Oct 10 '05 #15

Branimir Maksimovic

"moi" <avk@localhost> wrote in message
news:lr********************@casema.nl...

barbaros wrote:
Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something

mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.

Mmap() *can* be elegant, but gets ugly in the
presence of writes/appends to the same file (as in your
case).

Not neccessarily: ftruncate then mmap additional page(s)
(or lseek then write zeros to avoid fragmentation of file).
But, better is to just preallocate large enough file,
(write zero bytes, not ftruncate because of fragmentation)
if not enough then expand.
Also, there is *no performance benefit* over write().
Disk write is disk write, but with mmap you don't
have to manually read/write from file to app memory.
Mmap just maps a piece of your diskfile into memory,
but underneath just as much disk I/O is needed.
See the difference. No need for memory to file buffering
and data conversion in application code, therefore mmap
is most natural way to implement persistent data structure.
In the ultimate case (unclustered read access) you end up with one read
per record-fetch.
Same as with read. Of course caching strategies can be different ,
resulting that either mmap or read can be faster depending on situation.
In my experince mmap is better when dealing with large random access
files (large swap file).
For pure sequential reading, read() should be better.
Writing/appending is always faster (since multiple records can fit into
one one disk block)
Using a DB library does not change this, the library still has to
do the reading/writing on your behalf, and you en up with 10ms readtime
per block. It *can* save you some development effort.

Of course, except that everything goes through db interface and application
itself is limited by it. In this case this is not problem since db
eliminates
the need for hash table implementation, one just uses db interface
instead.

Greetings, Bane.

Oct 10 '05 #16

Similar topics

Python bug with dictionary

by: none | last post by:

or is it just me? I am having a problem with using a dictionary as an attribute of a class. This happens in python 1.5.2 and 2.2.2 which I am accessing through pythonwin builds 150 and 148...

Python

need advices... accessing a huge collection

by: GrelEns | last post by:

hello, having almost 1,000 tar.gz files in different directories (could not change that) and these archives contain over 1,000,000 text files. I would like to build a tool to access as quickly...

Python

Class initialization from a dictionary, how best?

by: brianobush | last post by:

# # My problem is that I want to create a # class, but the variables aren't known # all at once. So, I use a dictionary to # store the values in temporarily. # Then when I have a complete set, I...

Python

huge dictionary

by: barbaros | last post by:

Hello everybody, I need some advice on the following problem. I want to write a program (in C or C++) which will build a huge dictionary (hash array, if you prefer). The keys are strings, the...

C / C++

Bind a listbox to a dictionary object

by: john wright | last post by:

I have a dictionary oject I created and I want to bind a listbox to it. I am including the code for the dictionary object. Here is the error I am getting: "System.Exception: Complex...

Visual Basic .NET

Delete items in nested dictionary based on value.

by: Brian L. Troutwine | last post by:

I've got a problem that I can't seem to get my head around and hoped somebody might help me out a bit: I've got a dictionary, A, that is arbitarily large and may contains ints, None and more...

Python

Sorted and reversed on huge dict ?

by: vd12005 | last post by:

Hello, i would like to sort(ed) and reverse(d) the result of many huge dictionaries (a single dictionary will contain ~ 150000 entries). Keys are words, values are count (integer). i'm...

Python

huge dictionary -> bsddb/pickle question

by: lazy | last post by:

Hi, I have a dictionary something like this, key1=>{key11=> , key12=> , .... } For lack of wording, I will call outer dictionary as dict1 and its value(inner dictionary) dict2 which is a...

Python

Problem in finding the Dictionary Member Types

by: sachin2 | last post by:

I am using 3 types of dictionaries. 1) Dictionary<string, string > d = new Dictionary<string, string>(); 2) Dictionary<string, List<string>> d = new Dictionary<string, List<string>>(); 3)...

C# / C Sharp

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing