Encoding/Codepage: Can't Get There From Here

Christopher H. Laco

Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

-=Chris

Nov 22 '05 #1

Subscribe Post Reply

3299

Jon Skeet [C# MVP]

Christopher H. Laco <me********@gmail.com> wrote:

Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
I'd suggest that a better way would be to keep the data in Unicode
until you need it in HP Roman-6, and only decode it then. Going via
Encoding.Default is only going to confuse things, IMO.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

Writing an Encoding isn't that hard, especially for fixed-size
character sets. You might be able to use a lot of the code I've got for
EBCDIC. See
http://www.pobox.com/~skeet/csharp/miscutil

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 22 '05 #2

Christopher H. Laco

Jon Skeet [C# MVP] wrote:

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.

I'd suggest that a better way would be to keep the data in Unicode
until you need it in HP Roman-6, and only decode it then. Going via
Encoding.Default is only going to confuse things, IMO.

That's pretty much what happens. It's not really stored anywhere session
wise. I'm just trying to convert it to something the backend can handle
write before I write it to the socket.

Encoding.Default was my first try. I need to do some more digging. I'm
not sure what CodePage .NET things it is when I get it frim
IIS/ASP->COM->Assembly.

Writing an Encoding isn't that hard, especially for fixed-size
character sets. You might be able to use a lot of the code I've got for
EBCDIC. See
http://www.pobox.com/~skeet/csharp/miscutil

Yeah, that's what I was looking at yesterday. :-)

-=Chris

Nov 22 '05 #3

Christopher H. Laco

Christopher H. Laco wrote:

Long story longer. I need to get web user input into a backend system
that a) only grocks single byte encoding, b) expectes the data transer
to be 1 bytes = 1 character, and c) uses the HP Roman-6 codepage system
wide. As much as it sounds good, UTF/Unicode encoding is not an option,
nor is changing the codepage.

Tackling the first is easy via Encoding.Default.GetBytes and shoving it
over the network. However, Encoding.Default is the native 1280 ANSI
codepage.

What I need to do is convert the data from 1280/ISO Latin to HP Roman-6.
Thus far, I haven't found anything that leads me to believe this is
possible in .NET or that that specific codepage is supported without
coding a custom Encoding class to do the conversion.

The HP Roman-6 codepage is available on the net, so it should be a
matter of mapping the two codepages I would think.

Given the situation, what's the best way to tackle this problem?

-=Chris

To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.

Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.

I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage. It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

-=Chris

Nov 22 '05 #4

Jon Skeet [C# MVP]

Christopher H. Laco <me********@gmail.com> wrote:

To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.
You shouldn't need to convert from one codepage to another - you should
only need to convert from a .NET string (which is Unicode) to your
target code page.
Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.
I think you're getting hung up about a "source codepage" for no reason.
Are you correctly getting the data as a .NET string? If so, don't worry
about the original source any more.
I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage.
There's no real source codepage when you're converting a .NET string -
it's just Unicode.
It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

Unfortunately I don't believe you can do that. .NET isn't as pluggable
as it might be in a few places...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 22 '05 #5

Christopher H. Laco

Jon Skeet [C# MVP] wrote:

Christopher H. Laco <me********@gmail.com> wrote:
To be honest, I still don't know where to start. I'm still a little
comfused on how converting from one codepage to another actually happens.

You shouldn't need to convert from one codepage to another - you should
only need to convert from a .NET string (which is Unicode) to your
target code page.

Converting from latin to HP Roman8 is easy since I have the HP Roman 6
codepage listing the source/dest numbers.

How does conversion happen among all the various codepage variations?
There's some math or process there I'm failing to understand. It's
probably not hard; it's just that I've never had to think of such things
most of the time.

I think you're getting hung up about a "source codepage" for no reason.
Are you correctly getting the data as a .NET string? If so, don't worry
about the original source any more.

I can just hack a quick latin to hp roman byte conversion together, but
that's not very stable. I'm looking for a "proper" solution that can
convert to HP Roman8 regardless of the source codepage.

There's no real source codepage when you're converting a .NET string -
it's just Unicode.

It would be nice
if I could just register a custom codepage in .NET and get it using
Encoding.GetEncoding('mycustom').

Unfortunately I don't believe you can do that. .NET isn't as pluggable
as it might be in a few places...

I hear what you're saying. The string comes form the browser, through
ASP, through COM, into .NET when I have it in a string. So, yes, source
is irrelevant in this case. But for the sake of learning, I'd like to
understand how conversions between two codepages works in general,
regardless of .NET.

Just to recap for my sanity. So I've got a string:
string data = "LÁCÔ";
..NET stores it internally as unicode, but to get it over the wire to a
backend that doesn't understand multi-byte character semantics, I need
to do one of the following:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

The first is bad for obvious reaasons; anything above 127 is turned into ?.

The second converts the .NET unicode string variable data into the
default ANSI 1280 on windows. I can send this over the wire, but it
displays incorrectly on everything on the backend because it's using the
HP Roman8 codepage. Hence the need to convert to the Roman8 codepage
before sending the data.

Now, yes, I'm really needing to convert the string from a .NET unicode
string to HP Roman8. That's where I'm lost. I don't know how or where to
begin.

I know I need to subclass System.Text.Encoding, and that's it.

Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.

Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.

The web page in question appears to have been declared as ISO-8859-1
according to the headers and for now, I'll assume the browser is doing
the right thing and sending that encoding back. No special provisions
have been made in the page one way or the other. So .NET just guesses
correctly when converting it from ISO-8859-1 to Unicode for internal
variable storage.

I just don't know where to go from here.

Thanks for the help!
-=Chris

Nov 22 '05 #6

Jon Skeet [C# MVP]

Christopher H. Laco <me********@gmail.com> wrote:

I hear what you're saying. The string comes form the browser, through
ASP, through COM, into .NET when I have it in a string. So, yes, source
is irrelevant in this case. But for the sake of learning, I'd like to
understand how conversions between two codepages works in general,
regardless of .NET.

Just to recap for my sanity. So I've got a string:

string data = "LÁCÔ";
.NET stores it internally as unicode, but to get it over the wire to a
backend that doesn't understand multi-byte character semantics, I need
to do one of the following:

Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.
The first is bad for obvious reaasons; anything above 127 is turned into ?.

The second converts the .NET unicode string variable data into the
default ANSI 1280 on windows. I can send this over the wire, but it
displays incorrectly on everything on the backend because it's using the
HP Roman8 codepage. Hence the need to convert to the Roman8 codepage
before sending the data.

Now, yes, I'm really needing to convert the string from a .NET unicode
string to HP Roman8. That's where I'm lost. I don't know how or where to
begin.

I know I need to subclass System.Text.Encoding, and that's it.
Fortunately, that's quite easy - and it's all you need to do.
Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.
Well, Unicode *is* Latin 1 for the first 256 values, so all you've got
to do is:

1) Convert characters which are in Latin 1 to HP Roman8 appropriately.
2) Do "something" (e.g. use the encoded version of '?') with characters
which aren't in HP Roman8.
Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.
Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.
The web page in question appears to have been declared as ISO-8859-1
according to the headers and for now, I'll assume the browser is doing
the right thing and sending that encoding back.
It may not be sending ISO-8859-1 - there's no necessity for a browser
to make a request with the same encoding as the last page it looked at.
No special provisions
have been made in the page one way or the other. So .NET just guesses
correctly when converting it from ISO-8859-1 to Unicode for internal
variable storage.
No - it uses whatever the browser sends. It only has to guess if the
browser doesn't say what encoding to use.
I just don't know where to go from here.

Well, which part of deriving from Encoding are you having trouble with?
As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/

It's got a few optimisations in there which you'll need to understand
in order to read the code, but you probably won't need to do
equivalents yourself.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 22 '05 #7

Christopher H. Laco

Jon Skeet [C# MVP] wrote:

Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);

No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.
And that's the crux. There is no appropriate encoding built into .NET
right. There's nothing in GetEncoding() that is going to help me here.

I just don't yet understand the lines between a subclass of
System.Text.Encoding and the actually conversion code using in
Encoding.Convert...
Fortunately, that's quite easy - and it's all you need to do.

Thanks to a handy chart on the net comparing HP Roman8 to Latin 1, I
understand the numerical difference for bytes 127 to 255. I don't
understand the difference between HP Roman8 and Unicode do do the
conversion. That's why I'm hung up the source part; it's what I know
thus far.

Well, Unicode *is* Latin 1 for the first 256 values, so all you've got
to do is:

1) Convert characters which are in Latin 1 to HP Roman8 appropriately.
2) Do "something" (e.g. use the encoded version of '?') with characters
which aren't in HP Roman8.

Part of my misunderstanding is also what happened to the user input from
the browser into VB into .NET. Somewhere along the way, I wouldn't
expect it to all have been utf8/unicode.

Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.

It's not that simple in this case. The .NET assembly doing the network
I/O is completely unaware of the browser, ASP or the form post. It's
just given a string from Response.Form('SomeData'). That fact that that
works without much hassle all the way up to this point is a modern
miracle. ;-)

I just don't know where to go from here.

Well, which part of deriving from Encoding are you having trouble with?

Oh, that part where I have to take one byte, map it, and convert it to
another, and how that works by just subclassing Encoding and 'thats all
I have to do'. :-)

As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/

To be honest, it's confusing to me. It's way more than I need to some
extent. It's converting EBCEDIC to ASCII using an external codepage
file. OR what that the point of it all?

Time to submit a freatue requiest to .NET 2.5: pluggable codepages so
this isn't necessary. :-)

Thanks,
-=Chris

Nov 22 '05 #8

Christopher H. Laco

Christopher H. Laco wrote:

OK, the bell just went off. I didn't realize that it was two seperate
parts, and I only need to create the dat file *once* and use it in the
encoder via GetEncoding.

I was looking at a more literal (but less flexible) approach like the
other Encoding. stuff (UTF8/ASCII).

-=Chris

Nov 22 '05 #9

Jon Skeet [C# MVP]

Christopher H. Laco <me********@gmail.com> wrote:

Jon Skeet [C# MVP] wrote:
Byte[] outputbuffer = Encoding.ASCII.GetBytes(data);
Byte[] outputbuffer = Encoding.Default.GetBytes(data);
No, you need to do that with the appropriate encoding, not ASCII or the
default encoding.

And that's the crux. There is no appropriate encoding built into .NET
right. There's nothing in GetEncoding() that is going to help me here.

Indeed - so as I've been saying, you need to write your own encoding.
You don't need GetEncoding.
I just don't yet understand the lines between a subclass of
System.Text.Encoding and the actually conversion code using in
Encoding.Convert...
Encoding.Convert basically calls GetString or GetChars using the source
encoding to convert to Unicode, then GetBytes using the second encoding
to convert to bytes again.

Well, when the web browser sends a request, it includes (or at least
should include :) the encoding for whatever data it's sending - and
ASP.NET converts that into Unicode.

It's not that simple in this case. The .NET assembly doing the network
I/O is completely unaware of the browser, ASP or the form post. It's
just given a string from Response.Form('SomeData'). That fact that that
works without much hassle all the way up to this point is a modern
miracle. ;-)

Well, if it's given a string rather than an array of bytes, it's
already in the right format.

I just don't know where to go from here.

Well, which part of deriving from Encoding are you having trouble with?

Oh, that part where I have to take one byte, map it, and convert it to
another, and how that works by just subclassing Encoding and 'thats all
I have to do'. :-)

Yes - it's likely to end up being *very* simple. You don't convert one
byte to another though - you convert a sequence of chars to a sequence
of bytes, or vice versa.

As I said before, my EBCDIC encoding should give a good starting point,
although I think I gave the wrong URL - you want
http://www.pobox.com/~skeet/csharp/ebcdic/

To be honest, it's confusing to me. It's way more than I need to some
extent. It's converting EBCEDIC to ASCII using an external codepage
file. OR what that the point of it all?

I think you're looking at the wrong thing. I'm not talking about the
code on that page directly - I'm talking about the EBCDIC library
linked from that page, which gives you an example of how to create your
own Encoding.
Time to submit a freatue requiest to .NET 2.5: pluggable codepages so
this isn't necessary. :-)

Being pluggable wouldn't help you at all - you'd still have to write
your own encoding, and if you've done that you don't need to call
Encoding.GetEncoding at all.

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 22 '05 #10

Jon Skeet [C# MVP]

Christopher H. Laco <me********@gmail.com> wrote:

Christopher H. Laco wrote:

OK, the bell just went off. I didn't realize that it was two seperate
parts, and I only need to create the dat file *once* and use it in the
encoder via GetEncoding.
You may not need a data file at all - just because I happen to use them
for EBCDIC doesn't mean they necessarily fit what you're doing :)
I was looking at a more literal (but less flexible) approach like the
other Encoding. stuff (UTF8/ASCII).

Not sure what you mean by "more literal" approach, but if you're happy,
that's fine...

--
Jon Skeet - <sk***@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Nov 22 '05 #11

by: Ksenia Marasanova | last post by:

Hi, I have a little problem with encoding. Was hoping maybe anyone can help me to solve it. There is some amount of data in a database (PG) that must be inserted into Excel sheet and emailed....

Python

Encoding/Codepage: Can't Get There From Here

by: Christopher H. Laco | last post by:

Long story longer. I need to get web user input into a backend system that a) only grocks single byte encoding, b) expectes the data transer to be 1 bytes = 1 character, and c) uses the HP Roman-6...

.NET Framework

xml, character encoding, asp question

by: Mark | last post by:

Hi... I've been doing a lot of work both creating and consuming web services, and I notice there seems to be a discontinuity between a number of the different cogs in the wheel centering around...

ASP / Active Server Pages

asp.net chinese encoding

by: pabv | last post by:

Hello all, I am having a few issues with encoding to chinese characters and perhaps someone might be able to assist. At the moment I am only able to see chinese characters when displayed as...

C# / C Sharp

Simple Encoding Question

by: pagates | last post by:

Hello, I am playing a little with Encoding, and I have what is possibly (forgive me) a newbie-type question. I have a function that takes a string and a codepage (based upon the basic MSDN...

C# / C Sharp

query string encoding/decoding

by: Mark | last post by:

I've run a few simple tests looking at how query string encoding/decoding gets handled in asp.net, and it seems like the situation is even messier than it was in asp... Can't say I think much of the...

ASP.NET

UTF-8 Encoding

by: jmhmaine | last post by:

During the course of development cycle I receive HTML files from designers that use Macs and PCs, but use tools other then Visual Studio. So these files sometimes are not UTF-8 Encoded. I see...

ASP.NET

different encoding handling between old ASP and ASP.Net

by: Mark | last post by:

Hi... Just noticed something odd... In old ASP if you had query parameters that were invalid for their encoding (broken utf-8, say), ASP would give you back chars representing the 8-bit byte...

ASP.NET

Reading a not UTF-8 encoding file

by: leticia larrosa | last post by:

Hi, I try to read a file that have 8 bit character, but contain some character whose code is more than 128 (spanish character). When I read this file using ASCII (Dim oRead As StreamReader =...

Visual Basic .NET

Encoding problem......

by: Atlas | last post by:

I'm working on a multilanguage ASP/HTML site using a IIS6 web server. It perfectly works with two languages (english and italian) in this way: - basically the same ASP code for every language -...

ASP / Active Server Pages

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

Encoding/Codepage: Can't Get There From Here

Similar topics