hashtable and casing

William Stacey [MVP]

Doing a project that makes heavy use of domain names such as
"www.yahoo.com."
Domain names preserve case but are concidered equal if names are same but
case is different.
I know I can store these names as keys in a hashtable with case insens
comparer and CaseInsensitiveHashCodeProvider. That works fine. The benifit
is I can store the domain name and not worry about case and return the user
supplied case without storing an state, etc. However, this comes at a cost
because all string compare operations now must be case sensitive such as
endswith, etc. If case was all "lower" for example, string compares is very
fast and if interned, then really fast. I could store domain name as all
lower case and then store a bitArray that tells me what chars where upper
case. However that seems like a pain and still requires at least 32 bytes
for a 255 char domain name, or 3 bytes for a 20 char name. I could also
store both the original case as a string and the lower case version that is
used for all compare, endswith, hash, etc operations. However this doubles
the storage needed. This can be leveraged with string interning for
duplicates. That is my most attractive option in terms of performance I
think, but was wonder what others think? Cheers!

--
William Stacey, MVP

Nov 16 '05 #1

Subscribe Post Reply

1256

Justin Rogers

Go back a couple of days and look up that IndexOf I helped a user with.
You should probably write your own custom string operations for this one
that give you maximum speed with the trade-off, that you won't be culture
aware. In this case, your case insens work does not need to be culture aware
it simply has to follow the RFC's for domain names which are fairly strict.
--
Justin Rogers
DigiTec Web Consultants, LLC.
Blog: http://weblogs.asp.net/justin_rogers

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:uT**************@TK2MSFTNGP10.phx.gbl...

Doing a project that makes heavy use of domain names such as
"www.yahoo.com."
Domain names preserve case but are concidered equal if names are same but
case is different.
I know I can store these names as keys in a hashtable with case insens
comparer and CaseInsensitiveHashCodeProvider. That works fine. The benifit
is I can store the domain name and not worry about case and return the user
supplied case without storing an state, etc. However, this comes at a cost
because all string compare operations now must be case sensitive such as
endswith, etc. If case was all "lower" for example, string compares is very
fast and if interned, then really fast. I could store domain name as all
lower case and then store a bitArray that tells me what chars where upper
case. However that seems like a pain and still requires at least 32 bytes
for a 255 char domain name, or 3 bytes for a 20 char name. I could also
store both the original case as a string and the lower case version that is
used for all compare, endswith, hash, etc operations. However this doubles
the storage needed. This can be leveraged with string interning for
duplicates. That is my most attractive option in terms of performance I
think, but was wonder what others think? Cheers!

--
William Stacey, MVP

Nov 16 '05 #2

John Wood

Why do you have to return the domain back to the user in the case it was
entered (it would actually make more sense to correct it to lower case I
think, because all domains are in lower case and anything else they enter is
probably mistyped).

You could even correct the domain by doing a reverse DNS lookup and storing
the result.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:uT**************@TK2MSFTNGP10.phx.gbl...

Doing a project that makes heavy use of domain names such as
"www.yahoo.com."
Domain names preserve case but are concidered equal if names are same but
case is different.
I know I can store these names as keys in a hashtable with case insens
comparer and CaseInsensitiveHashCodeProvider. That works fine. The benifit is I can store the domain name and not worry about case and return the user supplied case without storing an state, etc. However, this comes at a cost because all string compare operations now must be case sensitive such as
endswith, etc. If case was all "lower" for example, string compares is very fast and if interned, then really fast. I could store domain name as all
lower case and then store a bitArray that tells me what chars where upper
case. However that seems like a pain and still requires at least 32 bytes
for a 255 char domain name, or 3 bytes for a 20 char name. I could also
store both the original case as a string and the lower case version that is used for all compare, endswith, hash, etc operations. However this doubles the storage needed. This can be leveraged with string interning for
duplicates. That is my most attractive option in terms of performance I
think, but was wonder what others think? Cheers!

--
William Stacey, MVP

Nov 16 '05 #3

William Stacey [MVP]

> think, because all domains are in lower case and anything else they enter
is

probably mistyped).
That would be nice, but not allowed by the 1034-1035 . You must preserve
the case of domain names and labels.

You could even correct the domain by doing a reverse DNS lookup and storing the result.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:uT**************@TK2MSFTNGP10.phx.gbl...
Doing a project that makes heavy use of domain names such as
"www.yahoo.com."
Domain names preserve case but are concidered equal if names are same but case is different.
I know I can store these names as keys in a hashtable with case insens
comparer and CaseInsensitiveHashCodeProvider. That works fine. The

benifit
is I can store the domain name and not worry about case and return the

user
supplied case without storing an state, etc. However, this comes at a

cost
because all string compare operations now must be case sensitive such as
endswith, etc. If case was all "lower" for example, string compares is

very
fast and if interned, then really fast. I could store domain name as all lower case and then store a bitArray that tells me what chars where upper case. However that seems like a pain and still requires at least 32 bytes for a 255 char domain name, or 3 bytes for a 20 char name. I could also
store both the original case as a string and the lower case version that

is
used for all compare, endswith, hash, etc operations. However this

doubles
the storage needed. This can be leveraged with string interning for
duplicates. That is my most attractive option in terms of performance I
think, but was wonder what others think? Cheers!

--
William Stacey, MVP

Nov 16 '05 #4

John Wood

but but... why?

Internet explorer doesn't for a start. Enter a URL in messed up case, and
it'll correct the domain et al when it displays the page.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...

think, because all domains are in lower case and anything else they enter
is
probably mistyped).

That would be nice, but not allowed by the 1034-1035 . You must preserve
the case of domain names and labels.

You could even correct the domain by doing a reverse DNS lookup and

storing
the result.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:uT**************@TK2MSFTNGP10.phx.gbl...
Doing a project that makes heavy use of domain names such as
"www.yahoo.com."
Domain names preserve case but are concidered equal if names are same but case is different.
I know I can store these names as keys in a hashtable with case insens
comparer and CaseInsensitiveHashCodeProvider. That works fine. The

benifit
is I can store the domain name and not worry about case and return the

user
supplied case without storing an state, etc. However, this comes at a

cost
because all string compare operations now must be case sensitive such as endswith, etc. If case was all "lower" for example, string compares is very
fast and if interned, then really fast. I could store domain name as all lower case and then store a bitArray that tells me what chars where upper case. However that seems like a pain and still requires at least 32 bytes for a 255 char domain name, or 3 bytes for a 20 char name. I could
also store both the original case as a string and the lower case version that is
used for all compare, endswith, hash, etc operations. However this

doubles
the storage needed. This can be leveraged with string interning for
duplicates. That is my most attractive option in terms of performance

I think, but was wonder what others think? Cheers!

--
William Stacey, MVP

Nov 16 '05 #5

William Stacey [MVP]

That is IE (the application) that downcases it. The resolver and the dns
servers preserve the case that was entered when the RR was created. Utils
like dig and nslookup can be used to see that case is preserved. IE should
be the benchmark in this case.

--
William Stacey, MVP

"John Wood" <sp**@isannoying.com> wrote in message
news:#s*************@tk2msftngp13.phx.gbl...

but but... why?

Internet explorer doesn't for a start. Enter a URL in messed up case, and
it'll correct the domain et al when it displays the page.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
think, because all domains are in lower case and anything else they enter
is
probably mistyped).
That would be nice, but not allowed by the 1034-1035 . You must preserve the case of domain names and labels.

You could even correct the domain by doing a reverse DNS lookup and

storing
the result.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:uT**************@TK2MSFTNGP10.phx.gbl...
> Doing a project that makes heavy use of domain names such as
> "www.yahoo.com."
> Domain names preserve case but are concidered equal if names are same
but
> case is different.
> I know I can store these names as keys in a hashtable with case
insens > comparer and CaseInsensitiveHashCodeProvider. That works fine. The
benifit
> is I can store the domain name and not worry about case and return the user
> supplied case without storing an state, etc. However, this comes at a cost
> because all string compare operations now must be case sensitive

such as > endswith, etc. If case was all "lower" for example, string compares is very
> fast and if interned, then really fast. I could store domain name
as all
> lower case and then store a bitArray that tells me what chars where upper
> case. However that seems like a pain and still requires at least 32

bytes
> for a 255 char domain name, or 3 bytes for a 20 char name. I could

also > store both the original case as a string and the lower case version that is
> used for all compare, endswith, hash, etc operations. However this
doubles
> the storage needed. This can be leveraged with string interning for
> duplicates. That is my most attractive option in terms of
performance I > think, but was wonder what others think? Cheers!
>
> --
> William Stacey, MVP
>
>

Nov 16 '05 #6

John Wood

well i'm not saying that's the wrong thing to do... just interested in why
it's so important. Surely it's more important to reflect the intent of the
company/person hosting the site, than the person who entered in the site
name?

That's a bit like someone mispronouncing your name, and you continuing that
mispronunciation, rather than either correcting them, or ignoring them and
continuing with the correct pronunciation.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:em**************@TK2MSFTNGP10.phx.gbl...

That is IE (the application) that downcases it. The resolver and the dns
servers preserve the case that was entered when the RR was created. Utils
like dig and nslookup can be used to see that case is preserved. IE should be the benchmark in this case.

--
William Stacey, MVP

"John Wood" <sp**@isannoying.com> wrote in message
news:#s*************@tk2msftngp13.phx.gbl...
but but... why?

Internet explorer doesn't for a start. Enter a URL in messed up case, and
it'll correct the domain et al when it displays the page.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
> think, because all domains are in lower case and anything else they enter
is
> probably mistyped).

That would be nice, but not allowed by the 1034-1035 . You must preserve the case of domain names and labels.

>
> You could even correct the domain by doing a reverse DNS lookup and
storing
> the result.
>
> "William Stacey [MVP]" <st***********@mvps.org> wrote in message
> news:uT**************@TK2MSFTNGP10.phx.gbl...
> > Doing a project that makes heavy use of domain names such as
> > "www.yahoo.com."
> > Domain names preserve case but are concidered equal if names are same but
> > case is different.
> > I know I can store these names as keys in a hashtable with case insens > > comparer and CaseInsensitiveHashCodeProvider. That works fine. The > benifit
> > is I can store the domain name and not worry about case and return the > user
> > supplied case without storing an state, etc. However, this comes at a
> cost
> > because all string compare operations now must be case sensitive such
as
> > endswith, etc. If case was all "lower" for example, string

compares is
> very
> > fast and if interned, then really fast. I could store domain name as all
> > lower case and then store a bitArray that tells me what chars

where upper
> > case. However that seems like a pain and still requires at least 32 bytes
> > for a 255 char domain name, or 3 bytes for a 20 char name. I could also
> > store both the original case as a string and the lower case
version that
> is
> > used for all compare, endswith, hash, etc operations. However

this > doubles
> > the storage needed. This can be leveraged with string interning for > > duplicates. That is my most attractive option in terms of

performance
I
> > think, but was wonder what others think? Cheers!
> >
> > --
> > William Stacey, MVP
> >
> >
>
>

Nov 16 '05 #7

William Stacey [MVP]

If you do an axfr, for example, you will see the case of all your rrs in the
zone in the case you entered.
If you do "dig abc.test.com", the server will return "abc.test.com" even if
the case on the server is "ABC.test.com."
If you do "dig abC.test.com", the server will return "abC.test.com" - or the
same case as your question. The match is case insensitive. Not sure I know
how to comment other then that is how it works currently. I think the
important point is that the server maintains case, but does case insensitive
matching, so it does not matter what case the QName is sent in. Cheers,

--
William Stacey, MVP

"John Wood" <sp**@isannoying.com> wrote in message
news:eX**************@tk2msftngp13.phx.gbl...

well i'm not saying that's the wrong thing to do... just interested in why
it's so important. Surely it's more important to reflect the intent of the
company/person hosting the site, than the person who entered in the site
name?

That's a bit like someone mispronouncing your name, and you continuing that mispronunciation, rather than either correcting them, or ignoring them and
continuing with the correct pronunciation.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:em**************@TK2MSFTNGP10.phx.gbl...
That is IE (the application) that downcases it. The resolver and the dns
servers preserve the case that was entered when the RR was created. Utils like dig and nslookup can be used to see that case is preserved. IE should
be the benchmark in this case.

--
William Stacey, MVP

"John Wood" <sp**@isannoying.com> wrote in message
news:#s*************@tk2msftngp13.phx.gbl...
but but... why?

Internet explorer doesn't for a start. Enter a URL in messed up case, and it'll correct the domain et al when it displays the page.

"William Stacey [MVP]" <st***********@mvps.org> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
> > think, because all domains are in lower case and anything else they enter
> is
> > probably mistyped).
>
> That would be nice, but not allowed by the 1034-1035 . You must

preserve
> the case of domain names and labels.
>
> >
> > You could even correct the domain by doing a reverse DNS lookup and > storing
> > the result.
> >
> > "William Stacey [MVP]" <st***********@mvps.org> wrote in message
> > news:uT**************@TK2MSFTNGP10.phx.gbl...
> > > Doing a project that makes heavy use of domain names such as
> > > "www.yahoo.com."
> > > Domain names preserve case but are concidered equal if names are

same
> but
> > > case is different.
> > > I know I can store these names as keys in a hashtable with case

insens
> > > comparer and CaseInsensitiveHashCodeProvider. That works fine. The > > benifit
> > > is I can store the domain name and not worry about case and
return the
> > user
> > > supplied case without storing an state, etc. However, this
comes at
a
> > cost
> > > because all string compare operations now must be case sensitive such
as
> > > endswith, etc. If case was all "lower" for example, string

compares is
> > very
> > > fast and if interned, then really fast. I could store domain
name as
> all
> > > lower case and then store a bitArray that tells me what chars

where > upper
> > > case. However that seems like a pain and still requires at
least 32 > bytes
> > > for a 255 char domain name, or 3 bytes for a 20 char name. I could also
> > > store both the original case as a string and the lower case version that
> > is
> > > used for all compare, endswith, hash, etc operations. However this > > doubles
> > > the storage needed. This can be leveraged with string interning for > > > duplicates. That is my most attractive option in terms of

performance
I
> > > think, but was wonder what others think? Cheers!
> > >
> > > --
> > > William Stacey, MVP
> > >
> > >
> >
> >
>

Nov 16 '05 #8

Kevin Yu [MSFT]

Hi William,

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to know the best way to do
case-insensitive compare and storage. If there is any misunderstanding,
please feel free to let me know.

As far as I know, it's hard to get both of time complexity and space
complexity. When more performance is get, we will lose much room for
storage. When less memory is used, we'll get better performance.

So I think whether to choose time or space depends on the project. When the
server is very fast and not much users are accessing the service
simultaneously, we can save the URL with the original case and compare with
CaseInsensitiveHashCodeProvider. If the users are doing the compare
frequently, we can try to save the text in two editions and compare with
the lower cased edition.

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 16 '05 #9

William Stacey [MVP]

Agreed. I decided on caseinsensitive hcp and storing the domain names as
entered in what ever case they are. This does mean I can't intern them and
do quick object.refequals testing or simple string.equal(a) testing.
However, after you factor in that each request would require downcasing
(slow) and time of interning or getting intern pool ref to string, it takes
more time to do those two things. Hope you get what I mean. Cheers!

--
William Stacey, MVP

"Kevin Yu [MSFT]" <v-****@online.microsoft.com> wrote in message
news:i3**************@cpmsftngxa10.phx.gbl...

Hi William,

First of all, I would like to confirm my understanding of your issue. From
your description, I understand that you need to know the best way to do
case-insensitive compare and storage. If there is any misunderstanding,
please feel free to let me know.

As far as I know, it's hard to get both of time complexity and space
complexity. When more performance is get, we will lose much room for
storage. When less memory is used, we'll get better performance.

So I think whether to choose time or space depends on the project. When the server is very fast and not much users are accessing the service
simultaneously, we can save the URL with the original case and compare with CaseInsensitiveHashCodeProvider. If the users are doing the compare
frequently, we can try to save the text in two editions and compare with
the lower cased edition.

HTH. If anything is unclear, please feel free to reply to the post.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 16 '05 #10

Kevin Yu [MSFT]

Hi William,

It was glad to know that you have had the problem resolved. Thanks for
sharing your experience with all the people here. If you have any
questions, please feel free to post them in the community.

Kevin Yu
=======
"This posting is provided "AS IS" with no warranties, and confers no
rights."

Nov 16 '05 #11

Similar topics