473,287 Members | 1,827 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

Encoding/Codepage: Can't Get There From Here

Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

-=Chris
Nov 22 '05 #1
10 3289
Christopher H. Laco <me********@gmail.com> wrote:
Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
I'd suggest that a better way would be to keep the data in Unicode
until you need it in HP Roman-6, and only decode it then. Going via
Encoding.Default is only going to confuse things, IMO.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?


Writing an Encoding isn't that hard, especially for fixed-size
character sets. You might be able to use a lot of the code I've got for
EBCDIC. See
http://www.pobox.com/~skeet/csharp/miscutil

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 22 '05 #2
Jon Skeet [C# MVP] wrote:
What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.

I'd suggest that a better way would be to keep the data in Unicode
until you need it in HP Roman-6, and only decode it then. Going via
Encoding.Default is only going to confuse things, IMO.


That's pretty much what happens. It's not really stored anywhere session
wise. I'm just trying to convert it to something the backend can handle
write before I write it to the socket.

Encoding.Default was my first try. I need to do some more digging. I'm
not sure what CodePage .NET things it is when I get it frim
IIS/ASP->COM->Assembly.


Writing an Encoding isn't that hard, especially for fixed-size
character sets. You might be able to use a lot of the code I've got for
EBCDIC. See
http://www.pobox.com/~skeet/csharp/miscutil


Yeah, that's what I was looking at yesterday. :-)

-=Chris
Nov 22 '05 #3
Christopher H. Laco wrote:
Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

-=Chris


To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.

Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.

I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage. It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

-=Chris
Nov 22 '05 #4
Christopher H. Laco <me********@gmail.com> wrote:
To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.
You shouldn't need to convert from one codepage to another - you should
only need to convert from a .NET string (which is Unicode) to your
target code page.
Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.
I think you're getting hung up about a "source codepage" for no reason.
Are you correctly getting the data as a .NET string? If so, don't worry
about the original source any more.
I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage.
There's no real source codepage when you're converting a .NET string -
it's just Unicode.
It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').


Unfortunately I don't believe you can do that. .NET isn't as pluggable
as it might be in a few places...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 22 '05 #5
Jon Skeet [C# MVP] wrote:
Christopher H. Laco <me********@gmail.com> wrote:
To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.

You shouldn't need to convert from one codepage to another - you should
only need to convert from a .NET string (which is Unicode) to your
target code page.

Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.

I think you're getting hung up about a "source codepage" for no reason.
Are you correctly getting the data as a .NET string? If so, don't worry
about the original source any more.

I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage.

There's no real source codepage when you're converting a .NET string -
it's just Unicode.

It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

Unfortunately I don't believe you can do that. .NET isn't as pluggable
as it might be in a few places...


I hear what you're saying. The string comes form the browser, through
ASP, through COM, into .NET when I have it in a string. So, yes, source
is irrelevant in this case. But for the sake of learning, I'd like to
understand how conversions between two codepages works in general,
regardless of .NET.

Just to recap for my sanity. So I've got a string:
string data = "LÁCÔ";
..NET stores it internally as unicode, but to get it over the wire to a
backend that doesn't understand multi-byte character semantics, I need
to do one of the following:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);


The first is bad for obvious reaasons; anything above 127 is turned into ?.

The second converts the .NET unicode string variable data into the
default ANSI 1280 on windows. I can send this over the wire, but it
displays incorrectly on everything on the backend because it's using the
HP Roman8 codepage. Hence the need to convert to the Roman8 codepage
before sending the data.

Now, yes, I'm really needing to convert the string from a .NET unicode
string to HP Roman8. That's where I'm lost. I don't know how or where to
begin.

I know I need to subclass System.Text.Encoding, and that's it.

Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.

Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.

The web page in question appears to have been declared as ISO-8859-1
according to the headers and for now, I'll assume the browser is doing
the right thing and sending that encoding back. No special provisions
have been made in the page one way or the other. So .NET just guesses
correctly when converting it from ISO-8859-1 to Unicode for internal
variable storage.

I just don't know where to go from here.

Thanks for the help!
-=Chris
Nov 22 '05 #6
Christopher H. Laco <me********@gmail.com> wrote:
I hear what you're saying. The string comes form the browser, through
ASP, through COM, into .NET when I have it in a string. So, yes, source
is irrelevant in this case. But for the sake of learning, I'd like to
understand how conversions between two codepages works in general,
regardless of .NET.

Just to recap for my sanity. So I've got a string:
string data = "LÁCÔ";
.NET stores it internally as unicode, but to get it over the wire to a
backend that doesn't understand multi-byte character semantics, I need
to do one of the following:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);


No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.
The first is bad for obvious reaasons; anything above 127 is turned into ?.

The second converts the .NET unicode string variable data into the
default ANSI 1280 on windows. I can send this over the wire, but it
displays incorrectly on everything on the backend because it's using the
HP Roman8 codepage. Hence the need to convert to the Roman8 codepage
before sending the data.

Now, yes, I'm really needing to convert the string from a .NET unicode
string to HP Roman8. That's where I'm lost. I don't know how or where to
begin.

I know I need to subclass System.Text.Encoding, and that's it.
Fortunately, that's quite easy - and it's all you need to do.
Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.
Well, Unicode *is* Latin 1 for the first 256 values, so all you've got
to do is:

1) Convert characters which are in Latin 1 to HP Roman8 appropriately.
2) Do "something" (e.g. use the encoded version of '?') with characters
which aren't in HP Roman8.
Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.
Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.
The web page in question appears to have been declared as ISO-8859-1
according to the headers and for now, I'll assume the browser is doing
the right thing and sending that encoding back.
It may not be sending ISO-8859-1 - there's no necessity for a browser
to make a request with the same encoding as the last page it looked at.
No special provisions
have been made in the page one way or the other. So .NET just guesses
correctly when converting it from ISO-8859-1 to Unicode for internal
variable storage.
No - it uses whatever the browser sends. It only has to guess if the
browser doesn't say what encoding to use.
I just don't know where to go from here.


Well, which part of deriving from Encoding are you having trouble with?
As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/

It's got a few optimisations in there which you'll need to understand
in order to read the code, but you probably won't need to do
equivalents yourself.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 22 '05 #7
Jon Skeet [C# MVP] wrote:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.
And that's the crux. There is no appropriate encoding built into .NET
right. There's nothing in GetEncoding() that is going to help me here.

I just don't yet understand the lines between a subclass of
System.Text.Encoding and the actually conversion code using in
Encoding.Convert...
Fortunately, that's quite easy - and it's all you need to do.

Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.

Well, Unicode *is* Latin 1 for the first 256 values, so all you've got
to do is:

1) Convert characters which are in Latin 1 to HP Roman8 appropriately.
2) Do "something" (e.g. use the encoded version of '?') with characters
which aren't in HP Roman8.

Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.

Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.


It's not that simple in this case. The .NET assembly doing the network
I/O is completely unaware of the browser, ASP or the form post. It's
just given a string from Response.Form('SomeData'). That fact that that
works without much hassle all the way up to this point is a modern
miracle. ;-)

I just don't know where to go from here.

Well, which part of deriving from Encoding are you having trouble with?


Oh, that part where I have to take one byte, map it, and convert it to
another, and how that works by just subclassing Encoding and 'thats all
I have to do'. :-)

As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/


To be honest, it's confusing to me. It's way more than I need to some
extent. It's converting EBCEDIC to ASCII using an external codepage
file. OR what that the point of it all?

Time to submit a freatue requiest to .NET 2.5: pluggable codepages so
this isn't necessary. :-)

Thanks,
-=Chris
Nov 22 '05 #8
Christopher H. Laco wrote:

OK, the bell just went off. I didn't realize that it was two seperate
parts, and I only need to create the dat file *once* and use it in the
encoder via GetEncoding.

I was looking at a more literal (but less flexible) approach like the
other Encoding. stuff (UTF8/ASCII).

-=Chris
Nov 22 '05 #9
Christopher H. Laco <me********@gmail.com> wrote:
Jon Skeet [C# MVP] wrote:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);
No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.


And that's the crux. There is no appropriate encoding built into .NET
right. There's nothing in GetEncoding() that is going to help me here.


Indeed - so as I've been saying, you need to write your own encoding.
You don't need GetEncoding.
I just don't yet understand the lines between a subclass of
System.Text.Encoding and the actually conversion code using in
Encoding.Convert...
Encoding.Convert basically calls GetString or GetChars using the source
encoding to convert to Unicode, then GetBytes using the second encoding
to convert to bytes again.
Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.


It's not that simple in this case. The .NET assembly doing the network
I/O is completely unaware of the browser, ASP or the form post. It's
just given a string from Response.Form('SomeData'). That fact that that
works without much hassle all the way up to this point is a modern
miracle. ;-)


Well, if it's given a string rather than an array of bytes, it's
already in the right format.
I just don't know where to go from here.


Well, which part of deriving from Encoding are you having trouble with?


Oh, that part where I have to take one byte, map it, and convert it to
another, and how that works by just subclassing Encoding and 'thats all
I have to do'. :-)


Yes - it's likely to end up being *very* simple. You don't convert one
byte to another though - you convert a sequence of chars to a sequence
of bytes, or vice versa.
As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/


To be honest, it's confusing to me. It's way more than I need to some
extent. It's converting EBCEDIC to ASCII using an external codepage
file. OR what that the point of it all?


I think you're looking at the wrong thing. I'm not talking about the
code on that page directly - I'm talking about the EBCDIC library
linked from that page, which gives you an example of how to create your
own Encoding.
Time to submit a freatue requiest to .NET 2.5: pluggable codepages so
this isn't necessary. :-)


Being pluggable wouldn't help you at all - you'd still have to write
your own encoding, and if you've done that you don't need to call
Encoding.GetEncoding at all.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 22 '05 #10
Christopher H. Laco <me********@gmail.com> wrote:
Christopher H. Laco wrote:

OK, the bell just went off. I didn't realize that it was two seperate
parts, and I only need to create the dat file *once* and use it in the
encoder via GetEncoding.
You may not need a data file at all - just because I happen to use them
for EBCDIC doesn't mean they necessarily fit what you're doing :)
I was looking at a more literal (but less flexible) approach like the
other Encoding. stuff (UTF8/ASCII).


Not sure what you mean by "more literal" approach, but if you're happy,
that's fine...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too
Nov 22 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Ksenia Marasanova | last post by:
Hi, I have a little problem with encoding. Was hoping maybe anyone can help me to solve it. There is some amount of data in a database (PG) that must be inserted into Excel sheet and emailed....
10
by: Christopher H. Laco | last post by:
Long story longer. I need to get web user input into a backend system that a) only grocks single byte encoding, b) expectes the data transer to be 1 bytes = 1 character, and c) uses the HP Roman-6...
7
by: Mark | last post by:
Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around...
8
by: pabv | last post by:
Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only able to see chinese characters when displayed as...
8
by: pagates | last post by:
Hello, I am playing a little with Encoding, and I have what is possibly (forgive me) a newbie-type question. I have a function that takes a string and a codepage (based upon the basic MSDN...
9
by: Mark | last post by:
I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...
2
by: jmhmaine | last post by:
During the course of development cycle I receive HTML files from designers that use Macs and PCs, but use tools other then Visual Studio. So these files sometimes are not UTF-8 Encoded. I see...
4
by: Mark | last post by:
Hi... Just noticed something odd... In old ASP if you had query parameters that were invalid for their encoding (broken utf-8, say), ASP would give you back chars representing the 8-bit byte...
3
by: leticia larrosa | last post by:
Hi, I try to read a file that have 8 bit character, but contain some character whose code is more than 128 (spanish character). When I read this file using ASCII (Dim oRead As StreamReader =...
12
by: Atlas | last post by:
I'm working on a multilanguage ASP/HTML site using a IIS6 web server. It perfectly works with two languages (english and italian) in this way: - basically the same ASP code for every language -...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: marcoviolo | last post by:
Dear all, I would like to implement on my worksheet an vlookup dynamic , that consider a change of pivot excel via win32com, from an external excel (without open it) and save the new file into a...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
1
by: PapaRatzi | last post by:
Hello, I am teaching myself MS Access forms design and Visual Basic. I've created a table to capture a list of Top 30 singles and forms to capture new entries. The final step is a form (unbound)...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.