By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
435,136 Members | 1,212 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 435,136 IT Pros & Developers. It's quick & easy.

text decoding from dataset, hmm... help appreciated.

P: n/a
Hi all,

Here is my problem:

I have a SQL Server 2000 DB with various NVarChar, NText fields in its
tables.
For some stupid reason the data was inserted into these fields in UTF8
encoding.

However when you retrieve these values into a dataset and ToString() them
some
characters come out as garbage.

So therefore I have started writing a throw away app that will go through
all the
relevant tables and fields decoding and then updating with unicode values.

However I'm a bit confused and stuck. There seem to be lots of classes that
sound exactly like what I want however they either don't convert to unicode
or they expect the data to be in a byte array. Trying to convert the dataset
values from object to a byte array causes a invalid cast exception.

Any ideas or links to detail info about the encoding stuff would be much
appreciated.

Regards,
Peter
Jul 21 '05 #1
Share this Question
Share on Google+
40 Replies


P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Here is my problem:

I have a SQL Server 2000 DB with various NVarChar, NText fields in its
tables.
For some stupid reason the data was inserted into these fields in UTF8
encoding.
That shouldn't make any odds - the encoding shouldn't matter at all, as
what will end up in the database is unicode characters.
However when you retrieve these values into a dataset and ToString() them
some characters come out as garbage.

So therefore I have started writing a throw away app that will go through
all the
relevant tables and fields decoding and then updating with unicode values.

However I'm a bit confused and stuck. There seem to be lots of classes that
sound exactly like what I want however they either don't convert to unicode
or they expect the data to be in a byte array. Trying to convert the dataset
values from object to a byte array causes a invalid cast exception.

Any ideas or links to detail info about the encoding stuff would be much
appreciated.


See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #2

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Here is my problem:

I have a SQL Server 2000 DB with various NVarChar, NText fields in its
tables.
For some stupid reason the data was inserted into these fields in UTF8
encoding.
That shouldn't make any odds - the encoding shouldn't matter at all, as
what will end up in the database is unicode characters.
However when you retrieve these values into a dataset and ToString() them
some characters come out as garbage.

So therefore I have started writing a throw away app that will go through
all the
relevant tables and fields decoding and then updating with unicode values.

However I'm a bit confused and stuck. There seem to be lots of classes that
sound exactly like what I want however they either don't convert to unicode
or they expect the data to be in a byte array. Trying to convert the dataset
values from object to a byte array causes a invalid cast exception.

Any ideas or links to detail info about the encoding stuff would be much
appreciated.


See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #3

P: n/a
Hi Jon,

I told Peter in the language.vb group to ask this in here, because there was
the expert Jon Skeet, I think you can better.

(Now my advise looks as nothing and maybe I do it myself although I am not
that expert of course)
Cor
Jul 21 '05 #4

P: n/a
Hi Jon,

I told Peter in the language.vb group to ask this in here, because there was
the expert Jon Skeet, I think you can better.

(Now my advise looks as nothing and maybe I do it myself although I am not
that expert of course)
Cor
Jul 21 '05 #5

P: n/a
Cor Ligthert <no**********@planet.nl> wrote:
I told Peter in the language.vb group to ask this in here, because there was
the expert Jon Skeet, I think you can better.

(Now my advise looks as nothing and maybe I do it myself although I am not
that expert of course)


Once Peter has read through the article I referenced, he should
understand things better - so he'll be able to come back with more
detailed questions, hopefully.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #6

P: n/a
Cor Ligthert <no**********@planet.nl> wrote:
I told Peter in the language.vb group to ask this in here, because there was
the expert Jon Skeet, I think you can better.

(Now my advise looks as nothing and maybe I do it myself although I am not
that expert of course)


Once Peter has read through the article I referenced, he should
understand things better - so he'll be able to come back with more
detailed questions, hopefully.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #7

P: n/a
Hi,

Well regardless of "it shouldn't make any odds" it DOES.

One thing to point out here that I didn't mention in the original post
is this is in connection with a port from VB6 to VB.NET.

The original VB6 version did the storing in UTF8, the .NET port comes along
and on certain characters causes a problem.

Example an apostrophe stored in the database using UTF8 encoding when
an ADO.NET dataset ToString()'s the value instead of an apostrophe you get
2 or 3 nonsense characters.

Anyhow after many hours of frustration and trying to understand all the
..NET encoding classes I discovered that it is not possible to do what I want
with native .NET code.

This is because when using a dataset the values have already been implicitly
converted
and inaddtion you have to use ToString() any way, by this point the UTF8
value
has been corrupted and the decode to Unicode doesn't work.

Using a datareader and it's GetBytes() method doesn't help either because
that
method only works on Text and NText database types and I need it to work on
NVarChar as well.

But there was light at the end of the tunnel (well in my case any way).
I used COM Interop to use ADO 2.7 and a custom inhouse C++ DLL.
The former to get the data without mangling it behind my back the latter
to decode the UTF8 to Unicode that .NET likes. This saved me throwing away
everything I'd already done.

This sounds bad, but this is a one-off use util, once all the databases that
the
VB6 version used have been "fixed" to proper unicode and not utf8 this util
will never
be used again.

Regards,
Peter

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Here is my problem:

I have a SQL Server 2000 DB with various NVarChar, NText fields in its
tables.
For some stupid reason the data was inserted into these fields in UTF8
encoding.


That shouldn't make any odds - the encoding shouldn't matter at all, as
what will end up in the database is unicode characters.
However when you retrieve these values into a dataset and ToString() them some characters come out as garbage.

So therefore I have started writing a throw away app that will go through all the
relevant tables and fields decoding and then updating with unicode values.
However I'm a bit confused and stuck. There seem to be lots of classes that sound exactly like what I want however they either don't convert to unicode or they expect the data to be in a byte array. Trying to convert the dataset values from object to a byte array causes a invalid cast exception.

Any ideas or links to detail info about the encoding stuff would be much
appreciated.


See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #8

P: n/a
Hi,

Well regardless of "it shouldn't make any odds" it DOES.

One thing to point out here that I didn't mention in the original post
is this is in connection with a port from VB6 to VB.NET.

The original VB6 version did the storing in UTF8, the .NET port comes along
and on certain characters causes a problem.

Example an apostrophe stored in the database using UTF8 encoding when
an ADO.NET dataset ToString()'s the value instead of an apostrophe you get
2 or 3 nonsense characters.

Anyhow after many hours of frustration and trying to understand all the
..NET encoding classes I discovered that it is not possible to do what I want
with native .NET code.

This is because when using a dataset the values have already been implicitly
converted
and inaddtion you have to use ToString() any way, by this point the UTF8
value
has been corrupted and the decode to Unicode doesn't work.

Using a datareader and it's GetBytes() method doesn't help either because
that
method only works on Text and NText database types and I need it to work on
NVarChar as well.

But there was light at the end of the tunnel (well in my case any way).
I used COM Interop to use ADO 2.7 and a custom inhouse C++ DLL.
The former to get the data without mangling it behind my back the latter
to decode the UTF8 to Unicode that .NET likes. This saved me throwing away
everything I'd already done.

This sounds bad, but this is a one-off use util, once all the databases that
the
VB6 version used have been "fixed" to proper unicode and not utf8 this util
will never
be used again.

Regards,
Peter

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Here is my problem:

I have a SQL Server 2000 DB with various NVarChar, NText fields in its
tables.
For some stupid reason the data was inserted into these fields in UTF8
encoding.


That shouldn't make any odds - the encoding shouldn't matter at all, as
what will end up in the database is unicode characters.
However when you retrieve these values into a dataset and ToString() them some characters come out as garbage.

So therefore I have started writing a throw away app that will go through all the
relevant tables and fields decoding and then updating with unicode values.
However I'm a bit confused and stuck. There seem to be lots of classes that sound exactly like what I want however they either don't convert to unicode or they expect the data to be in a byte array. Trying to convert the dataset values from object to a byte array causes a invalid cast exception.

Any ideas or links to detail info about the encoding stuff would be much
appreciated.


See http://www.pobox.com/~skeet/csharp/unicode.html

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #9

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Well regardless of "it shouldn't make any odds" it DOES.
Then chances are the data isn't being posted properly to start with.
One thing to point out here that I didn't mention in the original post
is this is in connection with a port from VB6 to VB.NET.

The original VB6 version did the storing in UTF8, the .NET port comes along
and on certain characters causes a problem.
It's just possible that VB6 is converting the string into UTF8, but
then treating each byte in the resulting sequence as a character, and
using the default encoding on the equivalent string. You'd certainly
get bizarre effects then.
Example an apostrophe stored in the database using UTF8 encoding when
an ADO.NET dataset ToString()'s the value instead of an apostrophe you get
2 or 3 nonsense characters.
What kind of apostrophe, and what kind of nonsense characters? If you
have an example of which Unicode character you're trying to get, and
which Unicode characters you're actually getting back (as integer
values, preferrably) that would be great.

See http://www.pobox.com/~skeet/csharp/d...ngunicode.html for
examples of how to get those character values.
Anyhow after many hours of frustration and trying to understand all the
.NET encoding classes I discovered that it is not possible to do what I want
with native .NET code.
Well, it might be - but it'll be a bit of a hack.
This is because when using a dataset the values have already been implicitly
converted and inaddtion you have to use ToString() any way, by this point
the UTF8 value has been corrupted and the decode to Unicode doesn't work.
No, I believe it's corrupt by the time it's in the database. It reminds
me of a similar problem someone else posted about a while ago.
Using a datareader and it's GetBytes() method doesn't help either
because that method only works on Text and NText database types and I
need it to work on NVarChar as well.

But there was light at the end of the tunnel (well in my case any way).
I used COM Interop to use ADO 2.7 and a custom inhouse C++ DLL.
The former to get the data without mangling it behind my back the latter
to decode the UTF8 to Unicode that .NET likes. This saved me throwing away
everything I'd already done.

This sounds bad, but this is a one-off use util, once all the databases that
the VB6 version used have been "fixed" to proper unicode and not utf8 this util
will never be used again.


The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.
If it *is* the problem I think it is, then you might be able to get the
real data using (assuming you've got the text variable from the
database):

byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #10

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Well regardless of "it shouldn't make any odds" it DOES.
Then chances are the data isn't being posted properly to start with.
One thing to point out here that I didn't mention in the original post
is this is in connection with a port from VB6 to VB.NET.

The original VB6 version did the storing in UTF8, the .NET port comes along
and on certain characters causes a problem.
It's just possible that VB6 is converting the string into UTF8, but
then treating each byte in the resulting sequence as a character, and
using the default encoding on the equivalent string. You'd certainly
get bizarre effects then.
Example an apostrophe stored in the database using UTF8 encoding when
an ADO.NET dataset ToString()'s the value instead of an apostrophe you get
2 or 3 nonsense characters.
What kind of apostrophe, and what kind of nonsense characters? If you
have an example of which Unicode character you're trying to get, and
which Unicode characters you're actually getting back (as integer
values, preferrably) that would be great.

See http://www.pobox.com/~skeet/csharp/d...ngunicode.html for
examples of how to get those character values.
Anyhow after many hours of frustration and trying to understand all the
.NET encoding classes I discovered that it is not possible to do what I want
with native .NET code.
Well, it might be - but it'll be a bit of a hack.
This is because when using a dataset the values have already been implicitly
converted and inaddtion you have to use ToString() any way, by this point
the UTF8 value has been corrupted and the decode to Unicode doesn't work.
No, I believe it's corrupt by the time it's in the database. It reminds
me of a similar problem someone else posted about a while ago.
Using a datareader and it's GetBytes() method doesn't help either
because that method only works on Text and NText database types and I
need it to work on NVarChar as well.

But there was light at the end of the tunnel (well in my case any way).
I used COM Interop to use ADO 2.7 and a custom inhouse C++ DLL.
The former to get the data without mangling it behind my back the latter
to decode the UTF8 to Unicode that .NET likes. This saved me throwing away
everything I'd already done.

This sounds bad, but this is a one-off use util, once all the databases that
the VB6 version used have been "fixed" to proper unicode and not utf8 this util
will never be used again.


The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.
If it *is* the problem I think it is, then you might be able to get the
real data using (assuming you've got the text variable from the
database):

byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #11

P: n/a
Hi,
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message news:MP************************@msnews.microsoft.c om... Peter Row <pe*******@oxfordcc.co.uk> wrote:
Well regardless of "it shouldn't make any odds" it DOES.
Then chances are the data isn't being posted properly to start with.

The data will be posted from a HTML form with the
enctype="multipart/form-data".
This is because there could be quite a lot of data, including uploaded
files.
One thing to point out here that I didn't mention in the original post
is this is in connection with a port from VB6 to VB.NET.

The original VB6 version did the storing in UTF8, the .NET port comes along and on certain characters causes a problem.


It's just possible that VB6 is converting the string into UTF8, but
then treating each byte in the resulting sequence as a character, and
using the default encoding on the equivalent string. You'd certainly
get bizarre effects then.

No this is not it. The custom DLL I mention previously is doing it.
VB6 although not as stringent as .NET uses unicode strings.
Example an apostrophe stored in the database using UTF8 encoding when
an ADO.NET dataset ToString()'s the value instead of an apostrophe you get 2 or 3 nonsense characters.


What kind of apostrophe, and what kind of nonsense characters? If you
have an example of which Unicode character you're trying to get, and
which Unicode characters you're actually getting back (as integer
values, preferrably) that would be great.

The characters were what looked like a backward euro symbol, the TM symbol
and 1 other character I can't remember.

---> snip
This is because when using a dataset the values have already been implicitly converted and inaddtion you have to use ToString() any way, by this point the UTF8 value has been corrupted and the decode to Unicode doesn't work.
No, I believe it's corrupt by the time it's in the database. It reminds
me of a similar problem someone else posted about a while ago.
no it's not.
If I use SQL Query analyser to select the data I see the same nonsense
characters as what .NET outputs. So VB6 either itself or using the custom
DLL is decoding the value. In addition when these values are output
in a web page it has a meta tag in the <head> section that tells the browser
it should use charset="UTF8", if I use IE options to manual change the
encoding to say western european (iso) then I get the same mangled
characters as seen when output from the .NET version.

Using a datareader and it's GetBytes() method doesn't help either
because that method only works on Text and NText database types and I
need it to work on NVarChar as well.

But there was light at the end of the tunnel (well in my case any way).
I used COM Interop to use ADO 2.7 and a custom inhouse C++ DLL.
The former to get the data without mangling it behind my back the latter
to decode the UTF8 to Unicode that .NET likes. This saved me throwing away everything I'd already done.

This sounds bad, but this is a one-off use util, once all the databases that the VB6 version used have been "fixed" to proper unicode and not utf8 this util will never be used again.


The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.

It doesn't it's the custom C++ DLL thats handling it.

If it *is* the problem I think it is, then you might be able to get the
real data using (assuming you've got the text variable from the
database):

byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

I maybe should have posted the .NET code I was using to do the decoding.
I tried virtually identical code to what you have above.

Although you might remember the 2 things I said previously.
1) The dataset has already done some conversion that mangles the data
and even if it hadn't then to get a string value for your 1st line of code
above
to pass to the GetBytes() method then I'd get ....

2) A mangled string because I would have to convert object (dataset) to
string
hence mangling the data from the DB again.

So as I said it is not possible to do what I want in native .NET code.
Hence I used COM interop to use ADO 2.7 and the custom C++ DLL used by
the original VB6 version of the software that inserted the UTF8 data in the
first
place.

When I run my little app over my test database and then use the .NET port of
the
software the apostrophes mention previously now come out correctly.
Thus to me this proves that the data was not corrupt before or during
insertion
by the VB6 version, it was the .NET version that was doing a conversion
from the source data (which was UTF8) to unicode and getting it wrong.

Any how now all I have to do is run my utility over a copy of one of the
other
databases which has more complicated languages in it like Greek, and if that
works then my util will have proved itself.

Thanks for taking the time to actually respond. And with some useful
info, makes a change from some other people - (*cough* crap ASP.NET MVPs).

Regards,
Peter
Jul 21 '05 #12

P: n/a
Hi,
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message news:MP************************@msnews.microsoft.c om... Peter Row <pe*******@oxfordcc.co.uk> wrote:
Well regardless of "it shouldn't make any odds" it DOES.
Then chances are the data isn't being posted properly to start with.

The data will be posted from a HTML form with the
enctype="multipart/form-data".
This is because there could be quite a lot of data, including uploaded
files.
One thing to point out here that I didn't mention in the original post
is this is in connection with a port from VB6 to VB.NET.

The original VB6 version did the storing in UTF8, the .NET port comes along and on certain characters causes a problem.


It's just possible that VB6 is converting the string into UTF8, but
then treating each byte in the resulting sequence as a character, and
using the default encoding on the equivalent string. You'd certainly
get bizarre effects then.

No this is not it. The custom DLL I mention previously is doing it.
VB6 although not as stringent as .NET uses unicode strings.
Example an apostrophe stored in the database using UTF8 encoding when
an ADO.NET dataset ToString()'s the value instead of an apostrophe you get 2 or 3 nonsense characters.


What kind of apostrophe, and what kind of nonsense characters? If you
have an example of which Unicode character you're trying to get, and
which Unicode characters you're actually getting back (as integer
values, preferrably) that would be great.

The characters were what looked like a backward euro symbol, the TM symbol
and 1 other character I can't remember.

---> snip
This is because when using a dataset the values have already been implicitly converted and inaddtion you have to use ToString() any way, by this point the UTF8 value has been corrupted and the decode to Unicode doesn't work.
No, I believe it's corrupt by the time it's in the database. It reminds
me of a similar problem someone else posted about a while ago.
no it's not.
If I use SQL Query analyser to select the data I see the same nonsense
characters as what .NET outputs. So VB6 either itself or using the custom
DLL is decoding the value. In addition when these values are output
in a web page it has a meta tag in the <head> section that tells the browser
it should use charset="UTF8", if I use IE options to manual change the
encoding to say western european (iso) then I get the same mangled
characters as seen when output from the .NET version.

Using a datareader and it's GetBytes() method doesn't help either
because that method only works on Text and NText database types and I
need it to work on NVarChar as well.

But there was light at the end of the tunnel (well in my case any way).
I used COM Interop to use ADO 2.7 and a custom inhouse C++ DLL.
The former to get the data without mangling it behind my back the latter
to decode the UTF8 to Unicode that .NET likes. This saved me throwing away everything I'd already done.

This sounds bad, but this is a one-off use util, once all the databases that the VB6 version used have been "fixed" to proper unicode and not utf8 this util will never be used again.


The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.

It doesn't it's the custom C++ DLL thats handling it.

If it *is* the problem I think it is, then you might be able to get the
real data using (assuming you've got the text variable from the
database):

byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

I maybe should have posted the .NET code I was using to do the decoding.
I tried virtually identical code to what you have above.

Although you might remember the 2 things I said previously.
1) The dataset has already done some conversion that mangles the data
and even if it hadn't then to get a string value for your 1st line of code
above
to pass to the GetBytes() method then I'd get ....

2) A mangled string because I would have to convert object (dataset) to
string
hence mangling the data from the DB again.

So as I said it is not possible to do what I want in native .NET code.
Hence I used COM interop to use ADO 2.7 and the custom C++ DLL used by
the original VB6 version of the software that inserted the UTF8 data in the
first
place.

When I run my little app over my test database and then use the .NET port of
the
software the apostrophes mention previously now come out correctly.
Thus to me this proves that the data was not corrupt before or during
insertion
by the VB6 version, it was the .NET version that was doing a conversion
from the source data (which was UTF8) to unicode and getting it wrong.

Any how now all I have to do is run my utility over a copy of one of the
other
databases which has more complicated languages in it like Greek, and if that
works then my util will have proved itself.

Thanks for taking the time to actually respond. And with some useful
info, makes a change from some other people - (*cough* crap ASP.NET MVPs).

Regards,
Peter
Jul 21 '05 #13

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Well regardless of "it shouldn't make any odds" it DOES.


Then chances are the data isn't being posted properly to start with. The data will be posted from a HTML form with the
enctype="multipart/form-data".
This is because there could be quite a lot of data, including uploaded
files.
I meant posted to the database from the ASP application - sorry, I
wasn't clear.
It's just possible that VB6 is converting the string into UTF8, but
then treating each byte in the resulting sequence as a character, and
using the default encoding on the equivalent string. You'd certainly
get bizarre effects then. No this is not it. The custom DLL I mention previously is doing it.
VB6 although not as stringent as .NET uses unicode strings.
Okay, the DLL, not the VB6 app - but the important thing is what ends
up in the database. The "converting...then" part of the above was the
important bit.
What kind of apostrophe, and what kind of nonsense characters? If you
have an example of which Unicode character you're trying to get, and
which Unicode characters you're actually getting back (as integer
values, preferrably) that would be great.


The characters were what looked like a backward euro symbol, the TM symbol
and 1 other character I can't remember.


Could you follow the instructions in the last link I gave you, to find
out the actual Unicode values?
No, I believe it's corrupt by the time it's in the database. It reminds
me of a similar problem someone else posted about a while ago. no it's not.
If I use SQL Query analyser to select the data I see the same nonsense
characters as what .NET outputs.
That suggests it's exactly what I thought then - if query analyser is
showing you "nonsense characters" then it's corrupt in the database.
So VB6 either itself or using the custom
DLL is decoding the value. In addition when these values are output
in a web page it has a meta tag in the <head> section that tells the browser
it should use charset="UTF8", if I use IE options to manual change the
encoding to say western european (iso) then I get the same mangled
characters as seen when output from the .NET version.
I suggest you ignore the browser for the moment - concentrate on
getting the correct values earlier, as discussed in the last link I
gave you.
The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.

It doesn't it's the custom C++ DLL thats handling it.


Then *that* shouldn't be doing any encoding itself either.
byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

I maybe should have posted the .NET code I was using to do the decoding.
I tried virtually identical code to what you have above.

Although you might remember the 2 things I said previously.
1) The dataset has already done some conversion that mangles the data
and even if it hadn't then to get a string value for your 1st line of code
above to pass to the GetBytes() method then I'd get ....


I thought you'd said you were *trying* to do some conversion, but that
it hadn't worked. I assumed you'd gone back to the original (corrupt)
data.
2) A mangled string because I would have to convert object (dataset) to
string hence mangling the data from the DB again.
I think you misunderstand what I believe has happened. My code would
"unmangle" the string (so realText would end up being the correct
string) if my belief as to what has happened is correct.
So as I said it is not possible to do what I want in native .NET code.
I still suspect it is, but there we go.
Hence I used COM interop to use ADO 2.7 and the custom C++ DLL used by
the original VB6 version of the software that inserted the UTF8 data in the
first place.

When I run my little app over my test database and then use the .NET port of
the software the apostrophes mention previously now come out correctly.
Thus to me this proves that the data was not corrupt before or during
insertion by the VB6 version, it was the .NET version that was doing a conversion
from the source data (which was UTF8) to unicode and getting it wrong.
No, it doesn't prove that *at all*. It proves that your custom C++ DLL
can read its incorrectly written values, that's all. The fact that SQL
Query Analyser shows the strange characters shows that it *is* corrupt
in the database.
Any how now all I have to do is run my utility over a copy of one of the
other databases which has more complicated languages in it like Greek, and if that
works then my util will have proved itself.

Thanks for taking the time to actually respond. And with some useful
info, makes a change from some other people - (*cough* crap ASP.NET MVPs).


If your util works, that's great - but I still maintain that it sounds
like it's nothing that .NET is doing wrong at all, it's just corrupt
data in the database, and that with a bit of work you could still fix
it from .NET (with the code shown above).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #14

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Well regardless of "it shouldn't make any odds" it DOES.


Then chances are the data isn't being posted properly to start with. The data will be posted from a HTML form with the
enctype="multipart/form-data".
This is because there could be quite a lot of data, including uploaded
files.
I meant posted to the database from the ASP application - sorry, I
wasn't clear.
It's just possible that VB6 is converting the string into UTF8, but
then treating each byte in the resulting sequence as a character, and
using the default encoding on the equivalent string. You'd certainly
get bizarre effects then. No this is not it. The custom DLL I mention previously is doing it.
VB6 although not as stringent as .NET uses unicode strings.
Okay, the DLL, not the VB6 app - but the important thing is what ends
up in the database. The "converting...then" part of the above was the
important bit.
What kind of apostrophe, and what kind of nonsense characters? If you
have an example of which Unicode character you're trying to get, and
which Unicode characters you're actually getting back (as integer
values, preferrably) that would be great.


The characters were what looked like a backward euro symbol, the TM symbol
and 1 other character I can't remember.


Could you follow the instructions in the last link I gave you, to find
out the actual Unicode values?
No, I believe it's corrupt by the time it's in the database. It reminds
me of a similar problem someone else posted about a while ago. no it's not.
If I use SQL Query analyser to select the data I see the same nonsense
characters as what .NET outputs.
That suggests it's exactly what I thought then - if query analyser is
showing you "nonsense characters" then it's corrupt in the database.
So VB6 either itself or using the custom
DLL is decoding the value. In addition when these values are output
in a web page it has a meta tag in the <head> section that tells the browser
it should use charset="UTF8", if I use IE options to manual change the
encoding to say western european (iso) then I get the same mangled
characters as seen when output from the .NET version.
I suggest you ignore the browser for the moment - concentrate on
getting the correct values earlier, as discussed in the last link I
gave you.
The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.

It doesn't it's the custom C++ DLL thats handling it.


Then *that* shouldn't be doing any encoding itself either.
byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

I maybe should have posted the .NET code I was using to do the decoding.
I tried virtually identical code to what you have above.

Although you might remember the 2 things I said previously.
1) The dataset has already done some conversion that mangles the data
and even if it hadn't then to get a string value for your 1st line of code
above to pass to the GetBytes() method then I'd get ....


I thought you'd said you were *trying* to do some conversion, but that
it hadn't worked. I assumed you'd gone back to the original (corrupt)
data.
2) A mangled string because I would have to convert object (dataset) to
string hence mangling the data from the DB again.
I think you misunderstand what I believe has happened. My code would
"unmangle" the string (so realText would end up being the correct
string) if my belief as to what has happened is correct.
So as I said it is not possible to do what I want in native .NET code.
I still suspect it is, but there we go.
Hence I used COM interop to use ADO 2.7 and the custom C++ DLL used by
the original VB6 version of the software that inserted the UTF8 data in the
first place.

When I run my little app over my test database and then use the .NET port of
the software the apostrophes mention previously now come out correctly.
Thus to me this proves that the data was not corrupt before or during
insertion by the VB6 version, it was the .NET version that was doing a conversion
from the source data (which was UTF8) to unicode and getting it wrong.
No, it doesn't prove that *at all*. It proves that your custom C++ DLL
can read its incorrectly written values, that's all. The fact that SQL
Query Analyser shows the strange characters shows that it *is* corrupt
in the database.
Any how now all I have to do is run my utility over a copy of one of the
other databases which has more complicated languages in it like Greek, and if that
works then my util will have proved itself.

Thanks for taking the time to actually respond. And with some useful
info, makes a change from some other people - (*cough* crap ASP.NET MVPs).


If your util works, that's great - but I still maintain that it sounds
like it's nothing that .NET is doing wrong at all, it's just corrupt
data in the database, and that with a bit of work you could still fix
it from .NET (with the code shown above).

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #15

P: n/a
Hi,

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
> Well regardless of "it shouldn't make any odds" it DOES.

Then chances are the data isn't being posted properly to start with.
The data will be posted from a HTML form with the
enctype="multipart/form-data".
This is because there could be quite a lot of data, including uploaded
files.


I meant posted to the database from the ASP application - sorry, I
wasn't clear.

With the use of ADO 2.6 naturally.

---> snip
No, I believe it's corrupt by the time it's in the database. It reminds me of a similar problem someone else posted about a while ago.
no it's not.
If I use SQL Query analyser to select the data I see the same nonsense
characters as what .NET outputs.


That suggests it's exactly what I thought then - if query analyser is
showing you "nonsense characters" then it's corrupt in the database.

Not necessarily.
Query Analyser often outputs results and in fact even submits querys
based on your various settings. Which if you were do the same through
ADO would give you an error. i.e. in QA if you have an SP with a datetime
parameter and you enter it MySp @mydate = '2004/04/15' then this should
always work. However because all my settings in OS etc... say british
settings
then I have to but the date in dd/mm/yyyy order. However if Idid this with
ADO the opposite would be true.

--> snip
The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.

It doesn't it's the custom C++ DLL thats handling it.


Then *that* shouldn't be doing any encoding itself either.
byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

I maybe should have posted the .NET code I was using to do the decoding.
I tried virtually identical code to what you have above.

Although you might remember the 2 things I said previously.
1) The dataset has already done some conversion that mangles the data
and even if it hadn't then to get a string value for your 1st line of code above to pass to the GetBytes() method then I'd get ....


I thought you'd said you were *trying* to do some conversion, but that
it hadn't worked. I assumed you'd gone back to the original (corrupt)
data.

Yes I was. This is taking a lot to explain.
As I originally said I believe that the database is storing UTF8 encoded
data.
..NET I am now told uses UTF16.
So I got a dataset of the data and tried using virtually the same code as
you
along with a number of other methods to convert from UTF8 to unicode.

However as I explained (fairly straight forwardly I thought). The dataset
has
already done some conversion on the data behind the scenes so by the time my
code trys to say :

......GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))

The value has already been mangled and hence the decoding I am trying to do
has no effect. Even if the dataset didn't do this the CType(xxxxx, String)
would
have the same mangling affect.
2) A mangled string because I would have to convert object (dataset) to
string hence mangling the data from the DB again.


I think you misunderstand what I believe has happened. My code would
"unmangle" the string (so realText would end up being the correct
string) if my belief as to what has happened is correct.

See my last comment, I can't think of a way to say the same thing yet again
in a different way.
So as I said it is not possible to do what I want in native .NET code.


I still suspect it is, but there we go.
Hence I used COM interop to use ADO 2.7 and the custom C++ DLL used by
the original VB6 version of the software that inserted the UTF8 data in the first place.

When I run my little app over my test database and then use the .NET port of the software the apostrophes mention previously now come out correctly.
Thus to me this proves that the data was not corrupt before or during
insertion by the VB6 version, it was the .NET version that was doing a conversion from the source data (which was UTF8) to unicode and getting it wrong.


No, it doesn't prove that *at all*. It proves that your custom C++ DLL
can read its incorrectly written values, that's all. The fact that SQL
Query Analyser shows the strange characters shows that it *is* corrupt
in the database.

The custom C++ DLL (which I didn't write cause I don't know C++) gives me a
UTF8Codec class which has a Decode method on it.
Therefore to me that makes its intentions pretty clear. And since decoding
from
UTF8 to unicode is exactly what I was trying to do to me that suggests its
doing
what .NET won't allow me to.
Any how now all I have to do is run my utility over a copy of one of the
other databases which has more complicated languages in it like Greek, and if that works then my util will have proved itself.

Thanks for taking the time to actually respond. And with some useful
info, makes a change from some other people - (*cough* crap ASP.NET

MVPs).
If your util works, that's great - but I still maintain that it sounds
like it's nothing that .NET is doing wrong at all, it's just corrupt
data in the database, and that with a bit of work you could still fix
it from .NET (with the code shown above).

I tried the util on the DB with the Greek in it and all the garbage
instantly worked, hurrah!

What .NET is doing wrong as I have said previously several times and in
several
different ways is:
1) Datasets implicitly convert the data (messing it up) hence you can't
get to the raw bytes.
2) DataReaders only let you get the raw bytes for text and ntext SQL
Server fields and
I needed it to work on NVarChar as well so that was out.
3) The conversion from UTF8 to unicode via the GetBytes() method
requires a STRING.
In order to get that string I have to CType() to string the database
object value.
By doing this you would (if it weren't for (1) above) get a UNICODE
string but it would
actually be a mangled UTF8 string. Hence negating the UTF8 to
Unicode conversion
code.
Phew!

As a footnote if the browser was getting sent some corrupted text from the
database as you
suggest then how when it displays it assuming charset="UTF8" does it manage
to reverse
the supposed database corruption and display it correctly?

Regards,
Peter
Jul 21 '05 #16

P: n/a
Hi,

"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message news:MP************************@msnews.microsoft.c om...
Peter Row <pe*******@oxfordcc.co.uk> wrote:
> Well regardless of "it shouldn't make any odds" it DOES.

Then chances are the data isn't being posted properly to start with.
The data will be posted from a HTML form with the
enctype="multipart/form-data".
This is because there could be quite a lot of data, including uploaded
files.


I meant posted to the database from the ASP application - sorry, I
wasn't clear.

With the use of ADO 2.6 naturally.

---> snip
No, I believe it's corrupt by the time it's in the database. It reminds me of a similar problem someone else posted about a while ago.
no it's not.
If I use SQL Query analyser to select the data I see the same nonsense
characters as what .NET outputs.


That suggests it's exactly what I thought then - if query analyser is
showing you "nonsense characters" then it's corrupt in the database.

Not necessarily.
Query Analyser often outputs results and in fact even submits querys
based on your various settings. Which if you were do the same through
ADO would give you an error. i.e. in QA if you have an SP with a datetime
parameter and you enter it MySp @mydate = '2004/04/15' then this should
always work. However because all my settings in OS etc... say british
settings
then I have to but the date in dd/mm/yyyy order. However if Idid this with
ADO the opposite would be true.

--> snip
The VB6 code shouldn't have to do any encoding at all - but I'm not
familiar enough with ADO and VB6 to say exactly what needs to be done.

It doesn't it's the custom C++ DLL thats handling it.


Then *that* shouldn't be doing any encoding itself either.
byte[] databaseValueAsBinary = Encoding.Default.GetBytes(text);
string realText = Encoding.UTF8.GetString(databaseValueAsBinary);

That's worth a try...

I maybe should have posted the .NET code I was using to do the decoding.
I tried virtually identical code to what you have above.

Although you might remember the 2 things I said previously.
1) The dataset has already done some conversion that mangles the data
and even if it hadn't then to get a string value for your 1st line of code above to pass to the GetBytes() method then I'd get ....


I thought you'd said you were *trying* to do some conversion, but that
it hadn't worked. I assumed you'd gone back to the original (corrupt)
data.

Yes I was. This is taking a lot to explain.
As I originally said I believe that the database is storing UTF8 encoded
data.
..NET I am now told uses UTF16.
So I got a dataset of the data and tried using virtually the same code as
you
along with a number of other methods to convert from UTF8 to unicode.

However as I explained (fairly straight forwardly I thought). The dataset
has
already done some conversion on the data behind the scenes so by the time my
code trys to say :

......GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))

The value has already been mangled and hence the decoding I am trying to do
has no effect. Even if the dataset didn't do this the CType(xxxxx, String)
would
have the same mangling affect.
2) A mangled string because I would have to convert object (dataset) to
string hence mangling the data from the DB again.


I think you misunderstand what I believe has happened. My code would
"unmangle" the string (so realText would end up being the correct
string) if my belief as to what has happened is correct.

See my last comment, I can't think of a way to say the same thing yet again
in a different way.
So as I said it is not possible to do what I want in native .NET code.


I still suspect it is, but there we go.
Hence I used COM interop to use ADO 2.7 and the custom C++ DLL used by
the original VB6 version of the software that inserted the UTF8 data in the first place.

When I run my little app over my test database and then use the .NET port of the software the apostrophes mention previously now come out correctly.
Thus to me this proves that the data was not corrupt before or during
insertion by the VB6 version, it was the .NET version that was doing a conversion from the source data (which was UTF8) to unicode and getting it wrong.


No, it doesn't prove that *at all*. It proves that your custom C++ DLL
can read its incorrectly written values, that's all. The fact that SQL
Query Analyser shows the strange characters shows that it *is* corrupt
in the database.

The custom C++ DLL (which I didn't write cause I don't know C++) gives me a
UTF8Codec class which has a Decode method on it.
Therefore to me that makes its intentions pretty clear. And since decoding
from
UTF8 to unicode is exactly what I was trying to do to me that suggests its
doing
what .NET won't allow me to.
Any how now all I have to do is run my utility over a copy of one of the
other databases which has more complicated languages in it like Greek, and if that works then my util will have proved itself.

Thanks for taking the time to actually respond. And with some useful
info, makes a change from some other people - (*cough* crap ASP.NET

MVPs).
If your util works, that's great - but I still maintain that it sounds
like it's nothing that .NET is doing wrong at all, it's just corrupt
data in the database, and that with a bit of work you could still fix
it from .NET (with the code shown above).

I tried the util on the DB with the Greek in it and all the garbage
instantly worked, hurrah!

What .NET is doing wrong as I have said previously several times and in
several
different ways is:
1) Datasets implicitly convert the data (messing it up) hence you can't
get to the raw bytes.
2) DataReaders only let you get the raw bytes for text and ntext SQL
Server fields and
I needed it to work on NVarChar as well so that was out.
3) The conversion from UTF8 to unicode via the GetBytes() method
requires a STRING.
In order to get that string I have to CType() to string the database
object value.
By doing this you would (if it weren't for (1) above) get a UNICODE
string but it would
actually be a mangled UTF8 string. Hence negating the UTF8 to
Unicode conversion
code.
Phew!

As a footnote if the browser was getting sent some corrupted text from the
database as you
suggest then how when it displays it assuming charset="UTF8" does it manage
to reverse
the supposed database corruption and display it correctly?

Regards,
Peter
Jul 21 '05 #17

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:

<snip>
That suggests it's exactly what I thought then - if query analyser is
showing you "nonsense characters" then it's corrupt in the database.
Not necessarily.
Query Analyser often outputs results and in fact even submits querys
based on your various settings. Which if you were do the same through
ADO would give you an error. i.e. in QA if you have an SP with a datetime
parameter and you enter it MySp @mydate = '2004/04/15' then this should
always work. However because all my settings in OS etc... say british
settings
then I have to but the date in dd/mm/yyyy order. However if Idid this with
ADO the opposite would be true.


Fair enough - but that wouldn't have the effect of corrupting
nvarchar/ntext strings.
I thought you'd said you were *trying* to do some conversion, but that
it hadn't worked. I assumed you'd gone back to the original (corrupt)
data.

Yes I was. This is taking a lot to explain.
As I originally said I believe that the database is storing UTF8 encoded
data.


The database is storing Unicode strings. How it stores them is
irrelevant (trust me - go along with it for now, hopefully it'll make
sense in a minute).
.NET I am now told uses UTF16.
It does internally - but fundamentally the point is that they're
Unicode strings. They're sequences of characters, not bytes.
So I got a dataset of the data and tried using virtually the same code as
you along with a number of other methods to convert from UTF8 to unicode.

However as I explained (fairly straight forwardly I thought). The dataset
has already done some conversion on the data behind the scenes so by the time my
code trys to say :

.....GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))

The value has already been mangled and hence the decoding I am trying to do
has no effect. Even if the dataset didn't do this the CType(xxxxx, String)
would have the same mangling affect.
No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
..NET.

The strings you are seeing in .NET are the strings which are in the
database.
2) A mangled string because I would have to convert object (dataset) to
string hence mangling the data from the DB again.


I think you misunderstand what I believe has happened. My code would
"unmangle" the string (so realText would end up being the correct
string) if my belief as to what has happened is correct.

See my last comment, I can't think of a way to say the same thing yet again
in a different way.


That's because I believe you have an incorrect view of what's going on
:)
So as I said it is not possible to do what I want in native .NET code.

No, it doesn't prove that *at all*. It proves that your custom C++ DLL
can read its incorrectly written values, that's all. The fact that SQL
Query Analyser shows the strange characters shows that it *is* corrupt
in the database.

The custom C++ DLL (which I didn't write cause I don't know C++) gives me a
UTF8Codec class which has a Decode method on it.
Therefore to me that makes its intentions pretty clear. And since decoding
from UTF8 to unicode is exactly what I was trying to do to me that suggests its
doing what .NET won't allow me to.


No, it's doing what .NET is discouraging you from by having a clear
separation of bytes from chars in the first place. That's not to say
it's impossible in .NET though.
If your util works, that's great - but I still maintain that it sounds
like it's nothing that .NET is doing wrong at all, it's just corrupt
data in the database, and that with a bit of work you could still fix
it from .NET (with the code shown above).

I tried the util on the DB with the Greek in it and all the garbage
instantly worked, hurrah!


Good. I still think it would be wise to understand what's been going on
though.
What .NET is doing wrong as I have said previously several times and in
several different ways is:
You've said it in different ways, but I believe you haven't listened to
what I've been suggesting has actually happened
1) Datasets implicitly convert the data (messing it up) hence you can't
get to the raw bytes.
There *are* no raw bytes - none that you should be bothered with,
anyway. The database logically stores *characters*, not bytes, in
ntext/nvarchar fields.
2) DataReaders only let you get the raw bytes for text and ntext SQL
Server fields and
I needed it to work on NVarChar as well so that was out.
See above.
3) The conversion from UTF8 to unicode via the GetBytes() method
requires a STRING.
In order to get that string I have to CType() to string the database
object value.
By doing this you would (if it weren't for (1) above) get a UNICODE
string but it would
actually be a mangled UTF8 string. Hence negating the UTF8 to
Unicode conversion
code.
Phew!
Please reread the article I linked to before - I think you still have
some conceptual problems in distinguishing between an encoded string
(which is a sequence of bytes) and a string which is a sequence of
characters.
As a footnote if the browser was getting sent some corrupted text from the
database as you suggest then how when it displays it assuming charset="UTF8"
does it manage to reverse the supposed database corruption and display it
correctly?


By looking at the data it's been given, working out that it probably
isn't what was intended, and making a heuristic guess at what's
happened.
Here's one more try at saying what I believe has happened to the
database:

1) Your VB6 app receives a string
2) The C++ DLL encodes that string into a sequence of UTF-8 bytes
3) The ADO layer then thinks that sequence of UTF-8 bytes is *actually*
a sequence of bytes in the default encoding
4) You end up with a sequence of characters in the database, each of
which is the character in the default encoding for a single byte of the
UTF-8 representation of the string

Reading this from anything which *correctly* gets the string from the
database, you end up with too many characters, some of which look
strange.

However, the C++ DLL will basically reverse its operation - so when
fetching the string from ADO, it ends up as a sequence of bytes in the
default encoding, which the C++ DLL then treats as a sequence of bytes
of UTF-8, and decodes back to Unicode characters, giving you back what
you thought of.

So essentially, the C++ DLL is applying an incorrect transformation -
but it's applying it in a reversible way, so that you end up with
apparently garbage data in the database, but data which can be
"sensibly" retrieved with the C++ DLL. That doesn't mean that .NET has
done *anything* wrong.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #18

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:

<snip>
That suggests it's exactly what I thought then - if query analyser is
showing you "nonsense characters" then it's corrupt in the database.
Not necessarily.
Query Analyser often outputs results and in fact even submits querys
based on your various settings. Which if you were do the same through
ADO would give you an error. i.e. in QA if you have an SP with a datetime
parameter and you enter it MySp @mydate = '2004/04/15' then this should
always work. However because all my settings in OS etc... say british
settings
then I have to but the date in dd/mm/yyyy order. However if Idid this with
ADO the opposite would be true.


Fair enough - but that wouldn't have the effect of corrupting
nvarchar/ntext strings.
I thought you'd said you were *trying* to do some conversion, but that
it hadn't worked. I assumed you'd gone back to the original (corrupt)
data.

Yes I was. This is taking a lot to explain.
As I originally said I believe that the database is storing UTF8 encoded
data.


The database is storing Unicode strings. How it stores them is
irrelevant (trust me - go along with it for now, hopefully it'll make
sense in a minute).
.NET I am now told uses UTF16.
It does internally - but fundamentally the point is that they're
Unicode strings. They're sequences of characters, not bytes.
So I got a dataset of the data and tried using virtually the same code as
you along with a number of other methods to convert from UTF8 to unicode.

However as I explained (fairly straight forwardly I thought). The dataset
has already done some conversion on the data behind the scenes so by the time my
code trys to say :

.....GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))

The value has already been mangled and hence the decoding I am trying to do
has no effect. Even if the dataset didn't do this the CType(xxxxx, String)
would have the same mangling affect.
No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
..NET.

The strings you are seeing in .NET are the strings which are in the
database.
2) A mangled string because I would have to convert object (dataset) to
string hence mangling the data from the DB again.


I think you misunderstand what I believe has happened. My code would
"unmangle" the string (so realText would end up being the correct
string) if my belief as to what has happened is correct.

See my last comment, I can't think of a way to say the same thing yet again
in a different way.


That's because I believe you have an incorrect view of what's going on
:)
So as I said it is not possible to do what I want in native .NET code.

No, it doesn't prove that *at all*. It proves that your custom C++ DLL
can read its incorrectly written values, that's all. The fact that SQL
Query Analyser shows the strange characters shows that it *is* corrupt
in the database.

The custom C++ DLL (which I didn't write cause I don't know C++) gives me a
UTF8Codec class which has a Decode method on it.
Therefore to me that makes its intentions pretty clear. And since decoding
from UTF8 to unicode is exactly what I was trying to do to me that suggests its
doing what .NET won't allow me to.


No, it's doing what .NET is discouraging you from by having a clear
separation of bytes from chars in the first place. That's not to say
it's impossible in .NET though.
If your util works, that's great - but I still maintain that it sounds
like it's nothing that .NET is doing wrong at all, it's just corrupt
data in the database, and that with a bit of work you could still fix
it from .NET (with the code shown above).

I tried the util on the DB with the Greek in it and all the garbage
instantly worked, hurrah!


Good. I still think it would be wise to understand what's been going on
though.
What .NET is doing wrong as I have said previously several times and in
several different ways is:
You've said it in different ways, but I believe you haven't listened to
what I've been suggesting has actually happened
1) Datasets implicitly convert the data (messing it up) hence you can't
get to the raw bytes.
There *are* no raw bytes - none that you should be bothered with,
anyway. The database logically stores *characters*, not bytes, in
ntext/nvarchar fields.
2) DataReaders only let you get the raw bytes for text and ntext SQL
Server fields and
I needed it to work on NVarChar as well so that was out.
See above.
3) The conversion from UTF8 to unicode via the GetBytes() method
requires a STRING.
In order to get that string I have to CType() to string the database
object value.
By doing this you would (if it weren't for (1) above) get a UNICODE
string but it would
actually be a mangled UTF8 string. Hence negating the UTF8 to
Unicode conversion
code.
Phew!
Please reread the article I linked to before - I think you still have
some conceptual problems in distinguishing between an encoded string
(which is a sequence of bytes) and a string which is a sequence of
characters.
As a footnote if the browser was getting sent some corrupted text from the
database as you suggest then how when it displays it assuming charset="UTF8"
does it manage to reverse the supposed database corruption and display it
correctly?


By looking at the data it's been given, working out that it probably
isn't what was intended, and making a heuristic guess at what's
happened.
Here's one more try at saying what I believe has happened to the
database:

1) Your VB6 app receives a string
2) The C++ DLL encodes that string into a sequence of UTF-8 bytes
3) The ADO layer then thinks that sequence of UTF-8 bytes is *actually*
a sequence of bytes in the default encoding
4) You end up with a sequence of characters in the database, each of
which is the character in the default encoding for a single byte of the
UTF-8 representation of the string

Reading this from anything which *correctly* gets the string from the
database, you end up with too many characters, some of which look
strange.

However, the C++ DLL will basically reverse its operation - so when
fetching the string from ADO, it ends up as a sequence of bytes in the
default encoding, which the C++ DLL then treats as a sequence of bytes
of UTF-8, and decodes back to Unicode characters, giving you back what
you thought of.

So essentially, the C++ DLL is applying an incorrect transformation -
but it's applying it in a reversible way, so that you end up with
apparently garbage data in the database, but data which can be
"sensibly" retrieved with the C++ DLL. That doesn't mean that .NET has
done *anything* wrong.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #19

P: n/a
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:

<snip>

One more thing: being a C# type person, I don't know the details of
what CType does under the covers. I suggest you stick to
Encoding.GetBytes and Encoding.GetString - that'll make it clearer in
our discussions exactly what's going on, for starters.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #20

P: n/a
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:

<snip>

One more thing: being a C# type person, I don't know the details of
what CType does under the covers. I suggest you stick to
Encoding.GetBytes and Encoding.GetString - that'll make it clearer in
our discussions exactly what's going on, for starters.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #21

P: n/a
Hi Jon,

You did make a lot of work from it even more than from a discussion with me.

This is a compliment of course.

Cor
Jul 21 '05 #22

P: n/a
Hi Jon,

You did make a lot of work from it even more than from a discussion with me.

This is a compliment of course.

Cor
Jul 21 '05 #23

P: n/a
Hi,

Errmm, you do know that there is a difference between an Object and a String
data type right?

Well a dataset returns all its values as Objects.
The Encoding.GetBytes() expects a Char array or a string as parameter
therefore you CANNOT pass in the dataset field value directly you have
to convert the dataset object to either a String or a Char array. If you
don't
then in VB.NET you get an "option strict on you must convert" warning.

The dataset field value (in my case) if you use the GetType method
returns String, therefore you have no choice but to convert from
Object to String.

If you don't use CType, then what do you use to convert a value
from an Object to a String?

Regards,
Peter
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:

<snip>

One more thing: being a C# type person, I don't know the details of
what CType does under the covers. I suggest you stick to
Encoding.GetBytes and Encoding.GetString - that'll make it clearer in
our discussions exactly what's going on, for starters.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #24

P: n/a
Hi,

Errmm, you do know that there is a difference between an Object and a String
data type right?

Well a dataset returns all its values as Objects.
The Encoding.GetBytes() expects a Char array or a string as parameter
therefore you CANNOT pass in the dataset field value directly you have
to convert the dataset object to either a String or a Char array. If you
don't
then in VB.NET you get an "option strict on you must convert" warning.

The dataset field value (in my case) if you use the GetType method
returns String, therefore you have no choice but to convert from
Object to String.

If you don't use CType, then what do you use to convert a value
from an Object to a String?

Regards,
Peter
"Jon Skeet [C# MVP]" <sk***@pobox.com> wrote in message
news:MP************************@msnews.microsoft.c om...
Jon Skeet [C# MVP] <sk***@pobox.com> wrote:

<snip>

One more thing: being a C# type person, I don't know the details of
what CType does under the covers. I suggest you stick to
Encoding.GetBytes and Encoding.GetString - that'll make it clearer in
our discussions exactly what's going on, for starters.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Jul 21 '05 #25

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Errmm, you do know that there is a difference between an Object and a String
data type right?
Yes.
Well a dataset returns all its values as Objects.
The Encoding.GetBytes() expects a Char array or a string as parameter
therefore you CANNOT pass in the dataset field value directly you have
to convert the dataset object to either a String or a Char array. If you
don't then in VB.NET you get an "option strict on you must convert" warning.


Ah, right. So the CType here is *just* doing a cast, basically. It's
not actually converting any data whatsoever. That's fine then - and is
fortunately unrelated to anything else in the problem.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #26

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
Errmm, you do know that there is a difference between an Object and a String
data type right?
Yes.
Well a dataset returns all its values as Objects.
The Encoding.GetBytes() expects a Char array or a string as parameter
therefore you CANNOT pass in the dataset field value directly you have
to convert the dataset object to either a String or a Char array. If you
don't then in VB.NET you get an "option strict on you must convert" warning.


Ah, right. So the CType here is *just* doing a cast, basically. It's
not actually converting any data whatsoever. That's fine then - and is
fortunately unrelated to anything else in the problem.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #27

P: n/a
Hi,

--> snip
.NET I am now told uses UTF16.


It does internally - but fundamentally the point is that they're
Unicode strings. They're sequences of characters, not bytes.
So I got a dataset of the data and tried using virtually the same code as you along with a number of other methods to convert from UTF8 to unicode.
However as I explained (fairly straight forwardly I thought). The dataset has already done some conversion on the data behind the scenes so by the time my code trys to say :

.....GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))

The value has already been mangled and hence the decoding I am trying to do has no effect. Even if the dataset didn't do this the CType(xxxxx, String) would have the same mangling affect.


No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
.NET.

If you don't know what the dataset is doing how can you say that any
conversion it may be doing won't have any effect?
--> snip
I will look at the link again, but I haven't got the time to bash my head
against
a brick wall trying to work it out so that it works in .NET. And since my
util
works on both the easiest (mono-lingual English) database and the hardest
(multi-lingual English/Greek) then I can't really justify spending more time
on it.

One thing that occurs to me is that assuming the C++ DLL has applied
some inappropriate conversion that it reverses then I fail to see how .NET
can undo that UNLESS it uses the same logical routine (except written in
VB.NET) to do the same reversal. Because all .NETs encoding classes
assume that the string is encoded correctly. But from what you've said
that is probably not the case.

Any how thanks again for all the effort you've put in.
At present its a case of get the once only task done as quickly as possible.
Whilst understanding unicode would undoubtedly helped, in the end I
got it done without having to know much at all. So until I get put onto a
project that actively fiddles with all this sort of stuff it is unlikely I
will get
the time to fully get to grips with this.

Regards,
Peter
Jul 21 '05 #28

P: n/a
Hi,

--> snip
.NET I am now told uses UTF16.


It does internally - but fundamentally the point is that they're
Unicode strings. They're sequences of characters, not bytes.
So I got a dataset of the data and tried using virtually the same code as you along with a number of other methods to convert from UTF8 to unicode.
However as I explained (fairly straight forwardly I thought). The dataset has already done some conversion on the data behind the scenes so by the time my code trys to say :

.....GetBytes(CType(ds.Tables(0).Rows(i).Item(0), String))

The value has already been mangled and hence the decoding I am trying to do has no effect. Even if the dataset didn't do this the CType(xxxxx, String) would have the same mangling affect.


No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
.NET.

If you don't know what the dataset is doing how can you say that any
conversion it may be doing won't have any effect?
--> snip
I will look at the link again, but I haven't got the time to bash my head
against
a brick wall trying to work it out so that it works in .NET. And since my
util
works on both the easiest (mono-lingual English) database and the hardest
(multi-lingual English/Greek) then I can't really justify spending more time
on it.

One thing that occurs to me is that assuming the C++ DLL has applied
some inappropriate conversion that it reverses then I fail to see how .NET
can undo that UNLESS it uses the same logical routine (except written in
VB.NET) to do the same reversal. Because all .NETs encoding classes
assume that the string is encoded correctly. But from what you've said
that is probably not the case.

Any how thanks again for all the effort you've put in.
At present its a case of get the once only task done as quickly as possible.
Whilst understanding unicode would undoubtedly helped, in the end I
got it done without having to know much at all. So until I get put onto a
project that actively fiddles with all this sort of stuff it is unlikely I
will get
the time to fully get to grips with this.

Regards,
Peter
Jul 21 '05 #29

P: n/a
Hi Peter,

Something I think on, when you are dealing with English only you have to do
with a relative small characterset.
A subset from the 437 set, it should not be impossible to put that in a
string array and than only to make a compare routine checking every
character yourself.

It looks a timespending procedure however mostly it is done in less time
than Jon took for you to help you.

Just an idea.

Cor
Jul 21 '05 #30

P: n/a
Hi Peter,

Something I think on, when you are dealing with English only you have to do
with a relative small characterset.
A subset from the 437 set, it should not be impossible to put that in a
string array and than only to make a compare routine checking every
character yourself.

It looks a timespending procedure however mostly it is done in less time
than Jon took for you to help you.

Just an idea.

Cor
Jul 21 '05 #31

P: n/a
And in additon, when you do that be sure to use the indexOf with a real char
and not with a string

Cor
Jul 21 '05 #32

P: n/a
And in additon, when you do that be sure to use the indexOf with a real char
and not with a string

Cor
Jul 21 '05 #33

P: n/a
Hi,

Sorry I can't make much sense of what you've written.

Regards,
Peter
"Cor Ligthert" <no**********@planet.nl> wrote in message
news:u8**************@tk2msftngp13.phx.gbl...
Hi Peter,

Something I think on, when you are dealing with English only you have to do with a relative small characterset.
A subset from the 437 set, it should not be impossible to put that in a
string array and than only to make a compare routine checking every
character yourself.

It looks a timespending procedure however mostly it is done in less time
than Jon took for you to help you.

Just an idea.

Cor

Jul 21 '05 #34

P: n/a
Hi,

Sorry I can't make much sense of what you've written.

Regards,
Peter
"Cor Ligthert" <no**********@planet.nl> wrote in message
news:u8**************@tk2msftngp13.phx.gbl...
Hi Peter,

Something I think on, when you are dealing with English only you have to do with a relative small characterset.
A subset from the 437 set, it should not be impossible to put that in a
string array and than only to make a compare routine checking every
character yourself.

It looks a timespending procedure however mostly it is done in less time
than Jon took for you to help you.

Just an idea.

Cor

Jul 21 '05 #35

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
.NET.
If you don't know what the dataset is doing how can you say that any
conversion it may be doing won't have any effect?


Because there's fundamentally no need to - it's a Unicode string (in
terms of character *set*, not character *encoding*) on both sides.
--> snip
I will look at the link again, but I haven't got the time to bash my head
against
a brick wall trying to work it out so that it works in .NET. And since my
util works on both the easiest (mono-lingual English) database and the hardest
(multi-lingual English/Greek) then I can't really justify spending more time
on it.
Fair enough.
One thing that occurs to me is that assuming the C++ DLL has applied
some inappropriate conversion that it reverses then I fail to see how .NET
can undo that UNLESS it uses the same logical routine (except written in
VB.NET) to do the same reversal. Because all .NETs encoding classes
assume that the string is encoded correctly. But from what you've said
that is probably not the case.
The code I gave you basically said, "Give me this string as a byte
array in the default character set, and then decode the result as if it
were the result of a UTF-8 encoding" - which would reverse what I
believe the C++ DLL has done.
Any how thanks again for all the effort you've put in.
No problem.
At present its a case of get the once only task done as quickly as possible.
Whilst understanding unicode would undoubtedly helped, in the end I
got it done without having to know much at all. So until I get put onto a
project that actively fiddles with all this sort of stuff it is unlikely I
will get the time to fully get to grips with this.


Sure - just hold off on the judgement that it's all .NET's fault for
the moment. When given the choice of "both SQL Query Analyser and .NET
are broken" against "the C++ DLL is misbehaving" I know which I think
is more likely :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #36

P: n/a
Peter Row <pe*******@oxfordcc.co.uk> wrote:
No. The dataset *hasn't* done any conversion, I believe - or at least,
none that should have any effect. It's presented the Unicode sequence
of characters in the database as a Unicode sequence of characters in
.NET.
If you don't know what the dataset is doing how can you say that any
conversion it may be doing won't have any effect?


Because there's fundamentally no need to - it's a Unicode string (in
terms of character *set*, not character *encoding*) on both sides.
--> snip
I will look at the link again, but I haven't got the time to bash my head
against
a brick wall trying to work it out so that it works in .NET. And since my
util works on both the easiest (mono-lingual English) database and the hardest
(multi-lingual English/Greek) then I can't really justify spending more time
on it.
Fair enough.
One thing that occurs to me is that assuming the C++ DLL has applied
some inappropriate conversion that it reverses then I fail to see how .NET
can undo that UNLESS it uses the same logical routine (except written in
VB.NET) to do the same reversal. Because all .NETs encoding classes
assume that the string is encoded correctly. But from what you've said
that is probably not the case.
The code I gave you basically said, "Give me this string as a byte
array in the default character set, and then decode the result as if it
were the result of a UTF-8 encoding" - which would reverse what I
believe the C++ DLL has done.
Any how thanks again for all the effort you've put in.
No problem.
At present its a case of get the once only task done as quickly as possible.
Whilst understanding unicode would undoubtedly helped, in the end I
got it done without having to know much at all. So until I get put onto a
project that actively fiddles with all this sort of stuff it is unlikely I
will get the time to fully get to grips with this.


Sure - just hold off on the judgement that it's all .NET's fault for
the moment. When given the choice of "both SQL Query Analyser and .NET
are broken" against "the C++ DLL is misbehaving" I know which I think
is more likely :)

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Jul 21 '05 #37

P: n/a
I was writting this and I think it can even much simpler than I thought
before, however see this as pseudo.

I type it in all in here so nothing is checked. It assumes that you only use
standard extended ASCII characters and no things like the Eurosign or
whatever which is not in that.

However I wrote it in here so it is just pseudo and as an idea.
(And when it would work I think it goes fast enough this is all processor
and memory).

dim y as integer
dim n as boolean
for a as integer = 0 to ds.tables.count - 1
for b as integer = 0 to ds.tables(a).rows.count-1
for c as integer = 0 to ds.tables(a).rows(b).items.count - 1
for i as integer = 0 to ds.tables(0).rows(b)(c).length - 1
for y = 30 to 255
n = true
if ds.tables(a).rows(b)(c) .substring(i,1) = chrw(y) then
n = false
exit for
end if
next
if n then
ds.tables(a).rows(b)(c).substring(i,1) = " "
next
next
next
next
update dataset

Cor
Jul 21 '05 #38

P: n/a
I was writting this and I think it can even much simpler than I thought
before, however see this as pseudo.

I type it in all in here so nothing is checked. It assumes that you only use
standard extended ASCII characters and no things like the Eurosign or
whatever which is not in that.

However I wrote it in here so it is just pseudo and as an idea.
(And when it would work I think it goes fast enough this is all processor
and memory).

dim y as integer
dim n as boolean
for a as integer = 0 to ds.tables.count - 1
for b as integer = 0 to ds.tables(a).rows.count-1
for c as integer = 0 to ds.tables(a).rows(b).items.count - 1
for i as integer = 0 to ds.tables(0).rows(b)(c).length - 1
for y = 30 to 255
n = true
if ds.tables(a).rows(b)(c) .substring(i,1) = chrw(y) then
n = false
exit for
end if
next
if n then
ds.tables(a).rows(b)(c).substring(i,1) = " "
next
next
next
next
update dataset

Cor
Jul 21 '05 #39

P: n/a
Hi,

Not at all sure what the whole point of you message is, it doesn't have
much (if any) relevance to the original post.

You pseudo code seems to try to detect extended ASCII characters in
dataset columns and replace them with a space.

If you read the original post I was trying to convert what I thought
were UTF8 encoded strings from a dataset into.NET unicode (UTF16)
and failing miserable (although did it in the end, but not using .NET
components).

Regards,
Peter
"Cor Ligthert" <no**********@planet.nl> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
I was writting this and I think it can even much simpler than I thought
before, however see this as pseudo.

I type it in all in here so nothing is checked. It assumes that you only use standard extended ASCII characters and no things like the Eurosign or
whatever which is not in that.

However I wrote it in here so it is just pseudo and as an idea.
(And when it would work I think it goes fast enough this is all processor
and memory).

dim y as integer
dim n as boolean
for a as integer = 0 to ds.tables.count - 1
for b as integer = 0 to ds.tables(a).rows.count-1
for c as integer = 0 to ds.tables(a).rows(b).items.count - 1
for i as integer = 0 to ds.tables(0).rows(b)(c).length - 1
for y = 30 to 255
n = true
if ds.tables(a).rows(b)(c) .substring(i,1) = chrw(y) then
n = false
exit for
end if
next
if n then
ds.tables(a).rows(b)(c).substring(i,1) = " "
next
next
next
next
update dataset

Cor

Jul 21 '05 #40

P: n/a
Hi,

Not at all sure what the whole point of you message is, it doesn't have
much (if any) relevance to the original post.

You pseudo code seems to try to detect extended ASCII characters in
dataset columns and replace them with a space.

If you read the original post I was trying to convert what I thought
were UTF8 encoded strings from a dataset into.NET unicode (UTF16)
and failing miserable (although did it in the end, but not using .NET
components).

Regards,
Peter
"Cor Ligthert" <no**********@planet.nl> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
I was writting this and I think it can even much simpler than I thought
before, however see this as pseudo.

I type it in all in here so nothing is checked. It assumes that you only use standard extended ASCII characters and no things like the Eurosign or
whatever which is not in that.

However I wrote it in here so it is just pseudo and as an idea.
(And when it would work I think it goes fast enough this is all processor
and memory).

dim y as integer
dim n as boolean
for a as integer = 0 to ds.tables.count - 1
for b as integer = 0 to ds.tables(a).rows.count-1
for c as integer = 0 to ds.tables(a).rows(b).items.count - 1
for i as integer = 0 to ds.tables(0).rows(b)(c).length - 1
for y = 30 to 255
n = true
if ds.tables(a).rows(b)(c) .substring(i,1) = chrw(y) then
n = false
exit for
end if
next
if n then
ds.tables(a).rows(b)(c).substring(i,1) = " "
next
next
next
next
update dataset

Cor

Jul 21 '05 #41

This discussion thread is closed

Replies have been disabled for this discussion.