fastest searchable datastructure?

Pieter

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #1

Subscribe Post Reply

1549

Marc Gravell

I need some type of array/list/... In which I can store objects together

with a unique key.

Sounds like Dictionary<TKey, TValue>...

The list may contain upto 10^12

Seriously? You do realise that even at one byte per item, with no
index overhead, padding, etc that's a TB?

I'm going to assume that is a typo - but even so, for large numbers
you may be better using a database approach, with a non-clustered
unique index on this field (only; don't span - let it use a bookmark
lookup to get the value) and ideally with the index on a different
file-group to the main data.

Marc

Jan 16 '08 #2

Pieter

Thansk both for your answer!

Actually: it's not a type: it could go theoretical upto 6*10^12 :-)
But as I said: I expect a practical implementation of +- 10^6...

I will use indeed SQL Server when the amoutn will be too big... But in case
it's underneath 10^6: Will a dictionary be better than a Hashtable? Or
soemthing else?

"Marc Gravell" <ma**********@gmail.comwrote in message
news:e0**********************************@h11g2000 prf.googlegroups.com...

>I need some type of array/list/... In which I can store objects together
with a unique key.
Sounds like Dictionary<TKey, TValue>...

>The list may contain upto 10^12
Seriously? You do realise that even at one byte per item, with no
index overhead, padding, etc that's a TB?

I'm going to assume that is a typo - but even so, for large numbers
you may be better using a database approach, with a non-clustered
unique index on this field (only; don't span - let it use a bookmark
lookup to get the value) and ideally with the index on a different
file-group to the main data.

Marc

Jan 16 '08 #3

=?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=

Do you know how much memory 10^12 of these "items" is going to require?
Because if you run out of physical RAM, your whole process will either blow
up or slow down dramatically due to paging.
So the best alternative may be a database to hold your items. A search on a
table with a clustered index on a primary key in this case will be the
fastest.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com
"Pieter" wrote:

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #4

Pieter

Hehe I do know :-)
The problem is: it will be for an experiment, and not every possiblity will
happen as much as the others. So I want definetly put the most popular in
some kind of local cache... See it as putting the first block of a B-tree in
the RAM memory...

"Peter Bromberg [C# MVP]" <pb*******@yahoo.NoSpamMaam.comwrote in message
news:E0**********************************@microsof t.com...

Do you know how much memory 10^12 of these "items" is going to require?
Because if you run out of physical RAM, your whole process will either
blow
up or slow down dramatically due to paging.
So the best alternative may be a database to hold your items. A search on
a
table with a clustered index on a primary key in this case will be the
fastest.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com
"Pieter" wrote:

>Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need
to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto
10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #5

=?ISO-8859-1?Q?Lasse_V=E5gs=E6ther_Karlsen?=

Pieter wrote:

Thansk both for your answer!

Actually: it's not a type: it could go theoretical upto 6*10^12 :-)
But as I said: I expect a practical implementation of +- 10^6...

I will use indeed SQL Server when the amoutn will be too big... But in case
it's underneath 10^6: Will a dictionary be better than a Hashtable? Or
soemthing else?

"Marc Gravell" <ma**********@gmail.comwrote in message
news:e0**********************************@h11g2000 prf.googlegroups.com...

>>I need some type of array/list/... In which I can store objects together
with a unique key.
Sounds like Dictionary<TKey, TValue>...

>>The list may contain upto 10^12
Seriously? You do realise that even at one byte per item, with no
index overhead, padding, etc that's a TB?

I'm going to assume that is a typo - but even so, for large numbers
you may be better using a database approach, with a non-clustered
unique index on this field (only; don't span - let it use a bookmark
lookup to get the value) and ideally with the index on a different
file-group to the main data.

Marc

I would carefully avoid solutions that require you to write the same
type of code multiple times, one type for < 10^6, one for <10^9 and one
for <10^12. If you need as many as 10^12, use a database.

If you absolutely want to draw the best performance out of everything,
abstract out the storage of this list to a new class, so that you can
inherit for it for an in-memory data structure, and inherit from it for
a database structure.

--
Lasse Vågsæther Karlsen
mailto:la***@vkarlsen.no
http://presentationmode.blogspot.com/
PGP KeyID: 0xBCDEA2E3

Jan 16 '08 #6

Pieter

Thanks Lasse. What it actually will do is: every object will be another
possivble state. But some states will be much more needed than others. So
the 'popular' ones will be put in a local cache. But it's the structure of
that cache that I'm worrying about: As mayb 80% of the searching will happen
in there, it shoudl be as fast as possible: So: which structure to use?

"Lasse Vågsæther Karlsen" <la***@vkarlsen.nowrote in message
news:%2****************@TK2MSFTNGP04.phx.gbl...

I would carefully avoid solutions that require you to write the same type
of code multiple times, one type for < 10^6, one for <10^9 and one for
<10^12. If you need as many as 10^12, use a database.

If you absolutely want to draw the best performance out of everything,
abstract out the storage of this list to a new class, so that you can
inherit for it for an in-memory data structure, and inherit from it for a
database structure.

--
Lasse Vågsæther Karlsen
mailto:la***@vkarlsen.no
http://presentationmode.blogspot.com/
PGP KeyID: 0xBCDEA2E3

Jan 16 '08 #7

Alberto Poblacion

"Pieter" <pi****************@hotmail.comwrote in message
news:OH****************@TK2MSFTNGP04.phx.gbl...

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

You will get the fastest performance with a Hash Table. Assuming that
you choose a good algorithm to assign the hash values, the hash table has
the advantage that the average number of accesses to the table to find a key
does not depend on the size of the table, but only on the percent of the
table that is full. With a table that is not terribly full (say 80 o 90%, I
don't remember the figures), the average number of accesses is one point
something. This beats a search on a B-Tree, which requires a number of
accesses that grows with the logarithm of the size of the table. Note that
I'm speaking about the *average* number of accesses. The *worst-case*
scenario is horribly bad, since it would require a full-scan of the hash
table. Fortunately, this is extremely improbable, on the condition that the
hashing algorithm and the collisions algorithm are properly chosen.

If you are going to deal with in-memory elemets using .Net, you can use
a Dictionary<key,value>, which will automatically use a hash table when the
number of stored elements is larger than some internally coded threshold.
You will need to use 64-bit code (and run it in a huge machine) if you want
to address 10^12 elements. If you need to store your data on disk, a
properly programmed hashing algorithm against a flat file can outperform a
database server, which uses trees to store its indices.

Jan 16 '08 #8

=?Utf-8?B?UGV0ZXIgQnJvbWJlcmcgW0MjIE1WUF0=?=

I'd think that a Generic List or Generic Dictionary would be faster as there
would be no boxing / unboxing involved in adding or retrieving typed elements.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com
"Pieter" wrote:

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #9

Cor Ligthert[MVP]

Pieter,

In past there was in this newsgroup somebody from Belgie active who was
always answering this question with.

Sorted List

http://msdn2.microsoft.com/en-us/lib...ortedlist.aspx

I have no expirience with that

Cor

Jan 16 '08 #10

Kevin Spencer

A Hashtable would be faster in terms of searching, as long as one was
searching by a key value.

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP

"Peter Bromberg [C# MVP]" <pb*******@yahoo.NoSpamMaam.comwrote in message
news:E9**********************************@microsof t.com...

I'd think that a Generic List or Generic Dictionary would be faster as
there
would be no boxing / unboxing involved in adding or retrieving typed
elements.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com
"Pieter" wrote:

>Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need
to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto
10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #11

Marc Gravell

(of course, given the numbers in question, *any* in-memory approach is
doomed... doomed...)

[Cor Ligthert]

System.Collections.SortedList

&
[Kevin Spencer]

A Hashtable would be faster in terms of searching, as long as one was
searching by a key value.

Yes, a hash-table implementation is better than a flat list, but
Hashtable is the non-typed version; for the reasons already given by
Peter the generic versions would generally be preferable -
specifically one of Dictionary<TKey, TValue>, SortedDictionary<TKey,
TValueand SortedList<TKey, TValue(depending on what operations are
important). The OP indicates that *read* performance is the key here
(not insertion/deletion); these report O(1), O(log n) and O(log n) for
retrieval - hence Dictionary<TKey, TValuewould be my first choice.
But the real answer is to test it with typical data: if you code the
app to work against the IDictionary<TKey, TValueinterface (rather
than any concrete implementation) you should be able to try different
implementations with likely data.

Marc

Jan 16 '08 #12

Michel Posseth [MCP]

Hello Pieter

Balena did some tests and posted the results on his website , however with
the transition form VB2themax to dotnet2themax the article seems to be
disapeared
However the result was that the hashtable was and still is the fastest key
value pair object , the higher the amount of data was the more efiicient the
hashtable was in comparisation to other methods for sowell insertion as
retrieving of the key value pairs ( Balena timed both the time it took to
built up the data tree as the time it took to retrieve n elements ) ,
however with the hughe amounts you are talking about a database would be a
much bether idea
HTH

Michel

"Pieter" <pi****************@hotmail.comschreef in bericht
news:OH****************@TK2MSFTNGP04.phx.gbl...

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #13

Michel Posseth [MCP]

Found some numbers ( needed to do some translation )

The test was with 100.000 items

compared was ArrayList, HashTable en SortedList

Arraylist was 4 times faster as the hashtable , hashtable was 8 - 100 times
faster as the sortedlist

please if you write some comparisation routines yourself then don`t forget
to set compile options in VB.Net remove overflow checks and enable
optimizations
otherwise we get again trolling messages here that C# is so much faster as
VB.Net with these options the same as they are in C# they should evaluate at
the same speed

"Pieter" <pi****************@hotmail.comschreef in bericht
news:OH****************@TK2MSFTNGP04.phx.gbl...

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 16 '08 #14

Jon Skeet [C# MVP]

On Jan 16, 9:59 pm, "Michel Posseth [MCP]" <M...@posseth.comwrote:

Found some numbers ( needed to do some translation )

The test was with 100.000 items

compared was ArrayList, HashTable en SortedList

Arraylist was 4 times faster as the hashtable , hashtable was 8 - 100 times
faster as the sortedlist

It really surprises me that you found an ArrayList faster than a
Hashtable with that many items. I could understand it for a really
small dataset, but at 100,000 a hashtable should be much faster than a
list. Could you post a link to the code?

Jon

Jan 17 '08 #15

Marc Gravell

Arraylist was 4 times faster as the hashtable
The other question would be: "at what, exactly?"

If the timings include the insertion time, then yes: a hash-table
(Hashtable or Dictionary<TKey,TValue>) will take longer: it needs to
obtain hashes, build buckets, etc (an ArrayList or List<Tjust has to
copy an array a few times). But the *read* performance should then be
/significantly/ better, assuming that the items are being fetched by
key (and not by looping).

Without inspectable, reproducable code I'm going to take this with a
pinch of salt...

Marc

Jan 17 '08 #16

Michel Posseth [MCP]

Before anyone takes this out of perspective

1 . these tests were done by Balena not by me , however the results and code
are not annymore on the website , but i have read on a dutch website that
you can find this in his 2003 book ( i have the 2002 and 2005 version )

2 . The conclusion was that the hashtable was the fastest when values were
retrieved with the key , however a arraylist was faster when looping
through the values

"Although the statistics might tell you that the Amazone rivers average
depth is 50 cm this does not mean that you can walk to the other side"

Michel

"Marc Gravell" <ma**********@gmail.comschreef in bericht
news:%2****************@TK2MSFTNGP06.phx.gbl...

>Arraylist was 4 times faster as the hashtable
The other question would be: "at what, exactly?"

If the timings include the insertion time, then yes: a hash-table
(Hashtable or Dictionary<TKey,TValue>) will take longer: it needs to
obtain hashes, build buckets, etc (an ArrayList or List<Tjust has to
copy an array a few times). But the *read* performance should then be
/significantly/ better, assuming that the items are being fetched by key
(and not by looping).

Without inspectable, reproducable code I'm going to take this with a pinch
of salt...

Marc

Jan 17 '08 #17

Jon Skeet [C# MVP]

Michel Posseth [MCP] <MS**@posseth.comwrote:

Before anyone takes this out of perspective

1 . these tests were done by Balena not by me , however the results and code
are not annymore on the website , but i have read on a dutch website that
you can find this in his 2003 book ( i have the 2002 and 2005 version )

2 . The conclusion was that the hashtable was the fastest when values were
retrieved with the key , however a arraylist was faster when looping
through the values

Right. That makes a lot more sense. As far as I'm aware, this thread is
about looking up by key, at which point the hashtable should win
easily.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Jan 17 '08 #18

Michel Posseth [MCP]

IMHO

The Hashtable will not work for the TS as the object limit in .Net = 2 GB
the hashtable way will blow up with a out of memory exception wit the
amounts the TS wants to store

so the only way that will probably work is the also in this thread proposed
database aproach

Michel

"Jon Skeet [C# MVP]" <sk***@pobox.comschreef in bericht
news:MP*********************@msnews.microsoft.com. ..

Michel Posseth [MCP] <MS**@posseth.comwrote:
>Before anyone takes this out of perspective

1 . these tests were done by Balena not by me , however the results and
code
are not annymore on the website , but i have read on a dutch website that
you can find this in his 2003 book ( i have the 2002 and 2005
ersion )

2 . The conclusion was that the hashtable was the fastest when values
were
retrieved with the key , however a arraylist was faster when looping
through the values

Right. That makes a lot more sense. As far as I'm aware, this thread is
about looking up by key, at which point the hashtable should win
easily.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Jan 18 '08 #19

Bill McCarthy

"Jon Skeet [C# MVP]" <sk***@pobox.comwrote in message
news:1e**********************************@e23g2000 prf.googlegroups.com...

>
I've got a datastructure of 100 million records in memory. I probably
shouldn't say what it's for, but it basically ends up being a 10,000
squared lookup, with a single byte result.

Doing this in memory (for the low, low cost of 100M of memory) gave me
a ~100,000% performance improvement over a database lookup :)

That of course would not be a searchable datastructure, more like an indexed
array where the location is calculated by offset ;) If it was a key per
item, then each item would require at least a 4 byte key, making the in
memory size at least 500M

Jan 19 '08 #20

Alex Clark

I know it's not always good netiquette to pry into other's projects in these
newsgroups, but is anyone else really interested to know what this actual
application is for...?

"Pieter" <pi****************@hotmail.comwrote in message
news:OH****************@TK2MSFTNGP04.phx.gbl...

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,
Pieter

Jan 24 '08 #21

Pieter

Oh, it's good to have a healthy curiousity ;-)

It's a Reinforcement Learning Project: Artificial Intelligence.
I'm making an agent that can play a board game using RL. To do this, it must
learn from what it did:
- the agent comes in different states
- every state has different actions
- the agent must choose between these actions
- at the end he gets his reward (win or lose). this reward is given to the
actions he choose.
By learnign (exploration) and exploitation he must maximize his wins :-)

The whole datastructure is needed to put my states in: I must be able to
search very fast is he had alreaddy been in that state :-)

It has to be finished by monday evening ;-)

"Alex Clark" <al**@microsoft.comwrote in message
news:%2****************@TK2MSFTNGP05.phx.gbl...

>
I know it's not always good netiquette to pry into other's projects in
these newsgroups, but is anyone else really interested to know what this
actual application is for...?

Jan 24 '08 #22

fastest searchable datastructure?

Similar topics