optimal load factor for small Hashtable

John A Grandy

For a Hashtable that is expected to contain no more than 100 DictionaryEntry
elements , what is the best load factor ?

( This Hashtable is a encapsulted in a class , an instance of which is
cached and methods on which are called by n threads ( n is approx = 25 ) )

Mar 28 '06 #1

Subscribe Post Reply

5050

Bruce Wood

I'm going to hazard a guess that locking and synchronization are going
to use up more horsepower than the hashtable search itself. However,
given the tiny size of the table and the enormous amount of memory on
most machines, I would go for a lower load factor: trade memory for a
chance at better speed. Back when I used to write real-time systems we
used load factors of 0.5 as a general rule. However, with a maximum
somewhere near 100 entries, you can afford to go lower than that
without paying too much of a penalty in memory. Even if the cost of
lookup is dwarfed by the cost of synchronization, you haven't paid a
memory penalty worth worrying about, so who cares?

However, remember that the load factor is only as good as the hashing
algorithm. Make sure that the GetHashCode for your objects is
statistically solid, because all of the twiddling of load factor in the
world won't help you if several of your data items return the same hash
code.

Mar 28 '06 #2

John A Grandy

Hi Bruce, and thanks for the response.

My objects are not large. I am wondering if load factor even matters at
all.

Is it the case the load factor most prominently comes into play when very
large numbers of items are consistently added/removed from the hashtable ?

"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@v46g2000cwv.googlegr oups.com...

I'm going to hazard a guess that locking and synchronization are going
to use up more horsepower than the hashtable search itself. However,
given the tiny size of the table and the enormous amount of memory on
most machines, I would go for a lower load factor: trade memory for a
chance at better speed. Back when I used to write real-time systems we
used load factors of 0.5 as a general rule. However, with a maximum
somewhere near 100 entries, you can afford to go lower than that
without paying too much of a penalty in memory. Even if the cost of
lookup is dwarfed by the cost of synchronization, you haven't paid a
memory penalty worth worrying about, so who cares?

However, remember that the load factor is only as good as the hashing
algorithm. Make sure that the GetHashCode for your objects is
statistically solid, because all of the twiddling of load factor in the
world won't help you if several of your data items return the same hash
code.

Mar 29 '06 #3

Bruce Wood

Load factor represents your statistical chances of having more than one
item in a hash bucket. Note that by "item" here I mean a reference, or
pointer to an item, so the size of your items doesn't matter: the same
thing is always stored in the Hashtable: a pointer.

I'm guessing that you know how a Hashtable is implemented, but in case
you don't, here's a quick rundown. A hash table is nothing more than an
array of lists. Each entry in the array is called a "bucket", and a
bucket can hold zero, one, or more of your items. The way it works is
that your class's GetHashCode method returns a hash code value, which
then goes through some simple manipulation and ends up determining
which bucket in the table the item will be filed under. If more than
one item ends up in the same bucket, they are chained into a list.

From the documentation, it appears that a load factor of 1 means one

Hashtable bucket for each expected item in the table. So, if you
expected 100 items, and you created a Hashtable with a load factor of
1, you would get 100 buckets in the table.

So, that sounds fine: one bucket per item. However, remember that which
item ends up in which bucket depends upon statistical mathematics
tricks, so you could very well have two items in one bucket and one
bucket empty, depending upon how your hashes work out. Two items in one
bucket means that you have to search down that chain looking for which
one is the one you really want. Two items isn't really a big deal, but
if that number gets too large, you start doing a lot of searching,
which takes time.

So, one solution is to lower the load factor. If you have two buckets
for every expected item, then your chances of getting two items in the
same bucket, and therefore having to search, just went down. Three
buckets for every expected item and the chances drop even more. If you
lower the load factor enough, your chances of getting two or more items
in one bucket become very low indeed.

(That is, assuming that you have a good hash function. If you have a
stupid hash function with a bad distribution, you can have 100 buckets
for each expected item and still get multiple items in the same
bucket.)

So, the size of your items doesn't matter, and how often you add and
remove doesn't matter. All the load factor does is change your chances
of having to search down a chain for the item you want.

Mar 29 '06 #4

Helge Jensen

John A Grandy wrote:

For a Hashtable that is expected to contain no more than 100 DictionaryEntry
elements , what is the best load factor ?

Why are you worrying about this? are you experiencing performance or
memory-problems?

If you are worried about hash-collisions, just allocate a large
dictionary to begin with:

new Hashtable(100*3);

--
Helge Jensen
mailto:he**********@slog.dk
sip:he**********@slog.dk
-=> Sebastian cover-music: http://ungdomshus.nu <=-

Mar 29 '06 #5

John A Grandy

Great explanation !

It my case I am app start populaing the Hashtable from an XML file stored in
RAM. The XML file rarely changes.

The Hashtable is read only. It lives in an instance of a class cached on a
multi-processor server and this instance is shared by many threads ( they
grab the cached class instance and then use an IDictionaryEnumerator to
iterate through the hashtable items, performing comparisons).

So, given that my data store is read-only, 100 items max, I am now wondering
if hashtable is the most appropriate data class .......
"Bruce Wood" <br*******@canada.com> wrote in message
news:11*********************@e56g2000cwe.googlegro ups.com...

Load factor represents your statistical chances of having more than one
item in a hash bucket. Note that by "item" here I mean a reference, or
pointer to an item, so the size of your items doesn't matter: the same
thing is always stored in the Hashtable: a pointer.

I'm guessing that you know how a Hashtable is implemented, but in case
you don't, here's a quick rundown. A hash table is nothing more than an
array of lists. Each entry in the array is called a "bucket", and a
bucket can hold zero, one, or more of your items. The way it works is
that your class's GetHashCode method returns a hash code value, which
then goes through some simple manipulation and ends up determining
which bucket in the table the item will be filed under. If more than
one item ends up in the same bucket, they are chained into a list.
From the documentation, it appears that a load factor of 1 means one

Hashtable bucket for each expected item in the table. So, if you
expected 100 items, and you created a Hashtable with a load factor of
1, you would get 100 buckets in the table.

So, that sounds fine: one bucket per item. However, remember that which
item ends up in which bucket depends upon statistical mathematics
tricks, so you could very well have two items in one bucket and one
bucket empty, depending upon how your hashes work out. Two items in one
bucket means that you have to search down that chain looking for which
one is the one you really want. Two items isn't really a big deal, but
if that number gets too large, you start doing a lot of searching,
which takes time.

So, one solution is to lower the load factor. If you have two buckets
for every expected item, then your chances of getting two items in the
same bucket, and therefore having to search, just went down. Three
buckets for every expected item and the chances drop even more. If you
lower the load factor enough, your chances of getting two or more items
in one bucket become very low indeed.

(That is, assuming that you have a good hash function. If you have a
stupid hash function with a bad distribution, you can have 100 buckets
for each expected item and still get multiple items in the same
bucket.)

So, the size of your items doesn't matter, and how often you add and
remove doesn't matter. All the load factor does is change your chances
of having to search down a chain for the item you want.

Mar 29 '06 #6

Bruce Wood

Are the clients always iterating over the table? If so, then you
definitely don't want a Hashtable.

What kinds of comparisons are they doing? Are they filtering the
entries, looking for subsets, or doing something else?

You say that the data in the hash table is "rarely updated". Under what
circumstances is it updated, and in what way (entries altered / added /
removed)?

If you give me more details maybe I can suggest a more appropriate data
structure.

Mar 29 '06 #7

Bruce Wood

Oh, sorry. Ignore the question about how the hash table changes. You've
already stated that it's read-only.

I'm interested in how it's used, though. I can probably suggest a
better structure.

Oh, and locking shouldn't be required if the structure is read-only.

Mar 29 '06 #8

John A Grandy

In terms of the business process, the Hashtable's keys and values are
strings.

Product name strings are examined to determine if they contain one or more
of the string keys in the Hashtable (this is where the Hashtable is
iterated). For each key so found , the corresponding value is added to an
ArrayList containing objects of a custom type (ProductAttribute).

The Hashtable is loaded from a read-only xml file into a cached instance of
an "umbrella" cache class. The xml file might change approx once a month.
It's unlikely to ever contain more than 100 elements. Each time it's
changed the Windows Service that houses this entire process must be
restarted to re-instantiate/re-load the cache instance.

So ... it appears that the only performance-relevant aspect is that many
threads (at least 25) possibly are running simultaneously, each grabbing
instances of the Hashtable from cache whenever they process a product item.
"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com...

Oh, sorry. Ignore the question about how the hash table changes. You've
already stated that it's read-only.

I'm interested in how it's used, though. I can probably suggest a
better structure.

Oh, and locking shouldn't be required if the structure is read-only.

Mar 30 '06 #9

Bruce Wood

> Product name strings are examined to determine if they contain one or more of the string keys in the Hashtable...

How is this examination done? How is the product name formed? What I'm
getting at is whether the product name has some delimiter that you
could use to break it up before checking, or whether the table keys can
appear in any subset of any word of the product name (where "words" may
be delimited by spaces, dashes, or whatever)?

Is there any way that you could break up the product name and search
for the bits in the Hashtable, rather than iterating over the table
checking for each key in the product name?

As well, is your ProductAttribute class immutable? Could you
instantiate all of the ProductAttribute objects once when the service
starts up and then never make another one? Or do they contain
information that changes for each product and so you need many
instances of the same ProductAttribute?

As for changes to the XML file, you could use FileSystemWatcher to
respond to a change in the file and reload it while the system is
running. That would remove the need to stop and restart the service
when the file changes. (Hey, one less thing that IT staff have to do in
order to make the system work properly is a money saver. This is
particularly true since the XML file changes rarely, and therefore
they're more likely to forget that they have to stop and restart the
service and spend time ( = $ ) tracking down the "bug" that results
from the XML file not being current in memory.)

It should be possible to rebuild whatever data structure you're using
from the XML file, then swap the (static) pointer to the list of
product attributes in one clean, atomic operation, so that searches
that are in progress on the old file contents will proceed on the old
structure, but searches that are about to start will grab the reference
to the new structure.

This, in turn, should mean (in theory) that you should need no locking
at all on the structure. I'm no expert in multithreading... I'd rather
have someone like Jon Skeet comment on this, but if you're not altering
the data structure at all, only reading it, then you shouldn't need to
lock on it. There would be only two operations that would change the
structure:

1) .NET's ability to move things about in memory as your code is
running. However, this should be transparent, and so shouldn't be
affected by locking.
2) Wholesale replacement of the structure when the XML file changes.
However, if it's done right, it should not require locking.

That means that you can have multiple threads reading the structure
simultaneously and they shouldn't conflict in any way.

Once I figure out better exactly what you're doing, I can tell you what
"the structure" should be. :-)

Mar 30 '06 #10

Jon Skeet [C# MVP]

Bruce Wood <br*******@canada.com> wrote:

<snip>

This, in turn, should mean (in theory) that you should need no locking
at all on the structure. I'm no expert in multithreading... I'd rather
have someone like Jon Skeet comment on this, but if you're not altering
the data structure at all, only reading it, then you shouldn't need to
lock on it. There would be only two operations that would change the
structure:

In theory, if the reading threads have been created before the data has
all been created, you should have a memory barrier in both the creating
thread and the reading thread, otherwise there's no guarantee the
reading thread will actually see the data.

In practice, it's very rarely an issue, but it's something to be aware
of.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #11

Bruce Wood

What about swapping the whole structure on the fly, Jon? I suppose that
if the reading threads were to cache the root pointer to the data
structure, and the updating thread were then to swap it out, the
reading threads may never see the new pointer and keep trying to read
the original structure.

Come to think of it, if each thread held the lock only long enough to
get the reference to the data structure, then released the lock,
contention should be almost non-existent and that would solve the
caching problem, no?

My first job was working on a multi-tasking, shared-memory embedded
platform. I eventually went on to other things because multitasking
makes my head hurt. It still makes my head hurt. ;-)

Mar 31 '06 #12

Jon Skeet [C# MVP]

Bruce Wood <br*******@canada.com> wrote:

What about swapping the whole structure on the fly, Jon? I suppose that
if the reading threads were to cache the root pointer to the data
structure, and the updating thread were then to swap it out, the
reading threads may never see the new pointer and keep trying to read
the original structure.

Come to think of it, if each thread held the lock only long enough to
get the reference to the data structure, then released the lock,
contention should be almost non-existent and that would solve the
caching problem, no?

Absolutely - that would definitely do it. Of course, you could very
temporarily be using an "old" version:

1) Reading thread acquires lock, gets reference, releases lock
2) Writing thread acquires lock, changes reference, releases lock
3) Reading thread uses reference it fetched in step 1.

However, that's unlikely to be an issue.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
If replying to the group, please do not mail me too

Mar 31 '06 #13

Bruce Wood

> Of course, you could very temporarily be using an "old" version.

Agreed. However, if you never alter the structure itself, just swap it
for a new one, then any threads that grabbed the old data structure's
reference before the swap will still be working with a complete, intact
structure, and the thing won't be GC'd until the last thread goes back
to get a reference to the structure again. So, there could be a few
seconds during the switchover when some threads are searching the new
structure and some are searching the old. Given the original
requirements, that doesn't sound like a big deal, but I'll let John
Grandy be the final judge of that. :-)

So, it should all hang together.

So far as the structure itself, I'm leaning toward:

1) If there is some way to break up the product code and then search
for each bit to see if it is a product attribute, then a Hashtable.

2) If there is no alternative but to search through a collection of
attributes, testing each one against the product code, then an
ArrayList of custom objects, each of which contains a key, a value, and
the code to do the comparison against the product code.

I'm waiting on John Grandy for more details, though... John?

Mar 31 '06 #14

John A Grandy

I think my problem is much simpler than the one you guys are discussing.

My source data for the Hashtable (or other data structure) is an XML file on
disk that is manually updated. It is no problem to stop/restart the Windows
Service whenever this XML file needs to be updated (it's on the order of
once/week).

"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@i39g2000cwa.googlegr oups.com...

Of course, you could very temporarily be using an "old" version.

Agreed. However, if you never alter the structure itself, just swap it
for a new one, then any threads that grabbed the old data structure's
reference before the swap will still be working with a complete, intact
structure, and the thing won't be GC'd until the last thread goes back
to get a reference to the structure again. So, there could be a few
seconds during the switchover when some threads are searching the new
structure and some are searching the old. Given the original
requirements, that doesn't sound like a big deal, but I'll let John
Grandy be the final judge of that. :-)

So, it should all hang together.

So far as the structure itself, I'm leaning toward:

1) If there is some way to break up the product code and then search
for each bit to see if it is a product attribute, then a Hashtable.

2) If there is no alternative but to search through a collection of
attributes, testing each one against the product code, then an
ArrayList of custom objects, each of which contains a key, a value, and
the code to do the comparison against the product code.

I'm waiting on John Grandy for more details, though... John?

Mar 31 '06 #15

Bruce Wood

Oh, I know. I just couldn't resist the temptation to automate it. :-)

Do you have time (and the inclination) to answer the other questions I
had about your data... then I can give you an idea of which data
structure would be most appropriate.

Mar 31 '06 #16

John A Grandy

Sorry, I got sidetracked ... which questions didn't I answer?

"Bruce Wood" <br*******@canada.com> wrote in message
news:11**********************@i40g2000cwc.googlegr oups.com...

Oh, I know. I just couldn't resist the temptation to automate it. :-)

Do you have time (and the inclination) to answer the other questions I
had about your data... then I can give you an idea of which data
structure would be most appropriate.

Apr 4 '06 #17

Bruce Wood

That's OK... I did too. :-)

Here's a repost:

Product name strings are examined to determine if they contain one or more of the string keys in the Hashtable...

How is this examination done? How is the product name formed? What I'm
getting at is whether the product name has some delimiter that you
could use to break it up before checking, or whether the table keys can

appear in any subset of any word of the product name (where "words" may

be delimited by spaces, dashes, or whatever)?

Is there any way that you could break up the product name and search
for the bits in the Hashtable, rather than iterating over the table
checking for each key in the product name?

As well, is your ProductAttribute class immutable? Could you
instantiate all of the ProductAttribute objects once when the service
starts up and then never make another one? Or do they contain
information that changes for each product and so you need many
instances of the same ProductAttribute?

Apr 5 '06 #18

optimal load factor for small Hashtable

Similar topics