UTF-8 to HTML conversion

Martin Trautmann

Hi all,

is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

The database output is UTF-8 or UTF-16 only - Thus almost every character
starts with ^@.

I've seen e.g.
http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as
JavaScript decoder - but maybe there's a recommended little helper that
could do:

- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML

such as Bulgarian and Russian charsets to &#10..;
.... or even ISO-Latin-1 8bits to HTML
ä -> ä
.... and maybe
EUR -> €

Thanks,
Martin

Jul 23 '05 #1

Subscribe Reply

11916

Eric Kenneth Bustad

In article <sl******************@id-685.user.individual.de>,
Martin Trautmann <tr***@gmx.de> wrote:

Hi all,

is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

The database output is UTF-8 or UTF-16 only - Thus almost every character
starts with ^@.

I've seen e.g.
http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as
JavaScript decoder - but maybe there's a recommended little helper that
could do:

- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML

such as Bulgarian and Russian charsets to &#10..;
... or even ISO-Latin-1 8bits to HTML
ä -> ä
... and maybe
EUR -> €

What's the problem with just using the UTF-8 output as is?

--
= Eric Bustad, Norwegian bachelor programmer

Jul 23 '05 #2

phil_gg04

> is there any kind of 'hiconv' or other (unix-like) conversion tool
that

would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

Martin,

HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8. Alternatively put a META tag in the header that
declares it as such. Long ago HTML was restricted to Latin1 but that
is history. (Maybe there is more to this than you are telling us?)

Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

$ recode utf8..html
Martel est considéré comme "père" de la spéléologie moderne
Martel est considéré comme "père" de la
spéléologie moderne

(Whether that example looks right depends on what happens to this post
between me and you...)

--Phil.

Jul 23 '05 #3

Smike

1. down load DELO editor
http://www.russiantext.ircdb.org/ruseditE.htm

2. type any Cyrillic text (editor works at NT/2000/XP, no Cyrillic font
setup or Cyrillic driver is required)

3. Select all or part of Cyrillic text

4. go to Format->Code convertion->Symbols to HTML Decimal Menu item

5. click and Cyrillic "privet" text will be converted to
привет

Bye
Smike
http://smike.ru
http://xedit.smike.ru
Martin Trautmann wrote:

Hi all,

is there any kind of 'hiconv' or other (unix-like) conversion tool that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

The database output is UTF-8 or UTF-16 only - Thus almost every character starts with ^@.

I've seen e.g.
http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as JavaScript decoder - but maybe there's a recommended little helper that could do:

- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML

such as Bulgarian and Russian charsets to &#10..;
... or even ISO-Latin-1 8bits to HTML
ä -> ä
... and maybe
EUR -> €

Thanks,
Martin

Jul 23 '05 #4

Martin Trautmann

On 1 Mar 2005 17:35:20 GMT, Eric Kenneth Bustad wrote:

What's the problem with just using the UTF-8 output as is?

it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte. But then there are
some remaining chars which should get some further translation.

Example (sorry if my charset is incorrect):

->

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

->

<h3><a name="01-5">Russian</a></h3><i>&25B >;H51AB20</i>. EKSMO-Press, 2001-07 (5-04-007995-8)<br>

->

<h3><a name="01-5">Russian</a></h3><i>Цвет Волшебства</i>. EKSMO-Press, 2001-07 (5-04-007995-8)<br>

Jul 23 '05 #5

Martin Trautmann

On 1 Mar 2005 11:07:24 -0800, ph*******@treefic.com wrote:

is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

Martin,

HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8. Alternatively put a META tag in the header that
declares it as such. Long ago HTML was restricted to Latin1 but that
is history. (Maybe there is more to this than you are telling us?)

I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is hardly
readable - as source file.
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

$ recode utf8..html
Martel est considéré comme "père" de la spéléologie moderne
Martel est considéré comme "père" de la
spéléologie moderne

ah, this would be an option. However, I'll have to exclude the real
<, > ", &
Thanks,
Martin

Jul 23 '05 #6

Martin Trautmann

On 1 Mar 2005 13:14:20 -0800, Smike wrote:

1. down load DELO editor
http://www.russiantext.ircdb.org/ruseditE.htm

2. type any Cyrillic text (editor works at NT/2000/XP, no Cyrillic font
setup or Cyrillic driver is required)

Thanks, but that's not an option:

is there any kind of 'hiconv' or other (unix-like) conversion tool that ^^^^^^^^^ would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

I don't have any Win, but Solaris and Mac OSX.

Jul 23 '05 #7

Lachlan Hunt

Martin Trautmann wrote:

On 1 Mar 2005 17:35:20 GMT, Eric Kenneth Bustad wrote:
What's the problem with just using the UTF-8 output as is?

it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte.

Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.

However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly, or
use some existing libraries that have already been written correctly for
the job, just don't implement it the way you suggested above.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 23 '05 #8

phil_gg04

>>> convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

HTML4 can be UTF-8; I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is

hardly readable - as source file

This is the best solution; just "recode UTF-16..UTF-8" and set up the
server to serve as UTF-8. If you're using Apache beware of how it sets
the default character set - the HTTP header's character set overrides
anything set in a META tag, and Apache now sets latin-1 as a default
(there are security issues behind this I believe). This caused me a
few hours of debugging once so I thought I'd mention it.

$ recode utf8..html

ah, this would be an option. However, I'll have to exclude the real
<, > ", &

Then you need -d:

$ recode -d utf8..html
<h1>¿Dondé?</h1>
<h1>¿Dondé?</h1>

--Phil.

Jul 23 '05 #9

Lachlan Hunt

Martin Trautmann wrote:

On 1 Mar 2005 11:07:24 -0800, ph*******@treefic.com wrote:
HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8...

I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is hardly
readable - as source file.

Why not? You just need to get an editor that supports UTF-16, rather
than an editor that only supports ISO-8859-1 (though most usually
support Windows-1252 and call it ISO-8859-1, or equivalent, anyway)

Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

$ recode utf8..html
Martel est considÃ©rÃ© comme "pÃ¨re" de la spÃ©lÃ©ologie moderne
Martel est considéré comme "père" de la
spéléologie moderne

What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 23 '05 #10

Martin Trautmann

On Wed, 02 Mar 2005 23:21:35 +1100, Lachlan Hunt wrote:

Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.
True - but that's exactly the range where I wanted ASCII/Latin instead
of UTF-16.
However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly, or
use some existing libraries that have already been written correctly for
the job, just don't implement it the way you suggested above.

That's why I was looking for a conversion tool. recode -d u6..h $file
is exactly what I was looking for.

Thanks,
Martin

Jul 23 '05 #11

Martin Trautmann

On 2 Mar 2005 04:30:56 -0800, ph*******@treefic.com wrote:

$ recode utf8..html

ah, this would be an option. However, I'll have to exclude the real
<, > ", &

Then you need -d:

$ recode -d utf8..html
<h1>¿Dondé?</h1>
<h1>¿Dondé?</h1>

Thanks - that's perfect. I would not have recognized the -d option
"convert only diacritics or alike for HTML/LaTeX"
as the best option.

Martin

Jul 23 '05 #12

Martin Trautmann

On Wed, 02 Mar 2005 23:31:15 +1100, Lachlan Hunt wrote:

What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?

Could you send me a sample?

recode does not produce the named characters first or at all - a named
output from my html editor is e.g.

Greek Θανατηφόρ&omi cron;&sigmaf; Βοηθό&sigmaf;

where recode does produce
Greek Θανατηφόρος Β&#95
9;ηθός

- Martin

Jul 23 '05 #13

phil_gg04

>> $ recode -d utf8..html

Thanks - that's perfect. I would not have recognized the -d option
"convert only diacritics or alike for HTML/LaTeX"
as the best option.

No, not from that description! Try this for the better documentation
(near the end):

$ info recode HTML

--Phil.

Jul 23 '05 #14

phil_gg04

>> $ recode utf8..html

What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?

Yes. Again, whether you see this depends on what happens to this
message between me and you. Appologies to any thai readers, I've just
copied and pasted this at random.

$ recode utf8..html
à¸ªà¸·à¹ˆà¸*à¸®à¹ˆà¸*à¸‡à¸à¸‡à¸£à¸²à¸¢à¸‡à¸²à¸™à¸ §à¹ˆà¸²à¸™à¸²à¸¢à¸•à¸¸à¸‡
à¸Šà¸µ à¸«à¸§à¹ˆà¸²
สื่อฮ่องกงรายงานว่านายตุง
ชี หว่า

Quote from the documentation:

Codes not having a mnemonic entity are output by `recode' using the
`&#NNN;' notation, where NNN is a decimal representation of the UCS
code value. When there is an entity name for a character, it is always
preferred over a numeric character reference. ASCII printable
characters are always generated directly. So is the newline.
See for example:

http://www.delorie.com/gnu/docs/recode/recode_49.html

Phil.

Jul 23 '05 #15

Martin Trautmann

On 2 Mar 2005 05:50:50 -0800, ph*******@treefic.com wrote:

When there is an entity name for a character, it is always
preferred over a numeric character reference.

Oops - you're right. It was my HTML editor which created numbers
instead of names, while recode used the name.

thanks,
Martin

Jul 23 '05 #16

Alan J. Flavell

On Wed, 2 Mar 2005, Lachlan Hunt wrote:

Martin Trautmann wrote:
it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte.

That would - at best - make iso-8859-1, rather than utf-8. But
horribly wrong if the stripped byte wasn't zero.
Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to
the US-ASCII subset
Indeed. us-ascii is indistinguishable from iso-8859-1 or utf-8 under
those restricted circumstances.

It you want a single additional step that does this code conversion,
then I'd be looking at recode (as others have also said). But it
would be better to look at existing process and identify a point at
which characters can be transparently converted as part of the
process. XML processors can typically do this with appropriate
parameters, for example, if an XML process is already involved.

Database access interfaces may be able to recode the data too, and so
on.
However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly,
or use some existing libraries that have already been written
correctly for the job, just don't implement it the way you suggested
above.

Oh, quite.

Btw, for manual processes, Mozilla Composer seems to do quite a
reasonable job when you "save and change character encoding". Back in
Netscape (<=4) days we used to call it "Netscape Composter", but it's
much better in its Mozilla embodiment, AFAICS.

Jul 23 '05 #17

Martin Trautmann

On Wed, 2 Mar 2005 14:14:42 +0000, Alan J. Flavell wrote:

On Wed, 2 Mar 2005, Lachlan Hunt wrote:
Martin Trautmann wrote:
it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte.

That would - at best - make iso-8859-1, rather than utf-8. But
horribly wrong if the stripped byte wasn't zero.

I was assuming a zero byte - whatever it is that is shown in vim as But it
would be better to look at existing process and identify a point at
which characters can be transparently converted as part of the
process. XML processors can typically do this with appropriate
parameters, for example, if an XML process is already involved.
We are talking about a 'stupid' data export to a text file which happens
to be UTF-16.
Btw, for manual processes, Mozilla Composer seems to do quite a
reasonable job when you "save and change character encoding". Back in
Netscape (<=4) days we used to call it "Netscape Composter", but it's
much better in its Mozilla embodiment, AFAICS.

Indeed - and that was my manual process first and was my reference for
comparison of the results.

I was surprised that it did not attempt to 'improve' anything that I
built myself (such as repairing "<a name=x>x</a>" -> "<a name=x></a>x")
It did fix e.g. </i><i> or <i></i> - and it did not create new
paragraphs on its own where the composter always exchanged 'random'
amounts of <p> and <br>.

Even better: it created validator approved HTML ;-)
However, I did not want to use Mozilla as my required UTF-16 convertor.

- Martin

Jul 23 '05 #18

Shmuel (Seymour J.) Metz

In <11**********************@f14g2000cwb.googlegroups .com>, on
03/01/2005
at 11:07 AM, ph*******@treefic.com said:

Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

What about bidi text?

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #19

phil_gg04

>>Anyway, if you really want to convert Unicode to latin1 + html

character entities, I believe that GNU recode can do what you want:

What about bidi text?

Good question. I believe that it generates one numeric character
entity for each unicode character without special consideration for
this. I think that this should just work, if the browser does the
right thing, shouldn't it?

--Phil.

Jul 23 '05 #20

Alan J. Flavell

On Wed, 2 Mar 2005, Shmuel (Seymour J.) Metz wrote:

In <11**********************@f14g2000cwb.googlegroups .com>, on
03/01/2005
at 11:07 AM, ph*******@treefic.com said:
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

What about bidi text?

What about it? There's no reason that a mere change of character
encoding should have any effect on bidi properties. Arabic and logical
Hebrew should simply *work* (at least as well as they ever work in
your choice of browser).

See also http://www.nirdagan.com/hebrew/compare

So where do you foresee a problem?

Jul 23 '05 #21

Lachlan Hunt

Martin Trautmann wrote:

On Wed, 02 Mar 2005 23:21:35 +1100, Lachlan Hunt wrote:
Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.

True - but that's exactly the range where I wanted ASCII/Latin instead
of UTF-16.

It won't work for the latin range, only the US-ASCII range I said.

US-ASCII: Code positions 0 to 127 (decimal)
In UTF-8, these are encoded as single octets which are identical to US-ASCII
Latin: Code positions 128 to 255
In UTF-8, these (and everything above) are encoded using multiple
octets. I think all of this range is encoded as 2 octets, but they *do
not* match the UTF-16 encoding.

eg. Encoding the copyright symbol: Â© (decimal 169)
UTF-8: 0xC2 0xA9
UTF-16: 0x00 0xA9
ISO-8859-1: 0xA9 (this is a single-octet encoding)

That's why stripping the other byte will simply not work to covert
UTF-16 to UTF-8. Stripping the first byte also won't work for any
character above 255 (outside the ISO-8859-1 range).

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 23 '05 #22

Martin Trautmann

On Thu, 03 Mar 2005 08:50:11 +1100, Lachlan Hunt wrote:

It won't work for the latin range, only the US-ASCII range I said.

Thanks for your further explanation. I actually did not have any
critical problems within this range yet. But I felt it was the wrong
move, apart from being incomplete.

Martin

Jul 23 '05 #23

Similar topics

4147

how to test text to see if maybe it is UTF-8????

by: lawrence | last post by:

Someone on www.php.net suggested using a seems_utf8() method to test text for UTF-8 character encoding but didn't specify how to write such a method. Can anyone suggest a test that might work?...

PHP

6365

Psycopg and queries with UTF-8 data

by: Alban Hertroys | last post by:

Another python/psycopg question, for which the solution is probably quite simple; I just don't know where to look. I have a query that inserts data originating from an utf-8 encoded XML file....

Python

8184

UTF-8 / German, Scandinavian letters - is it really this difficult?? Linux & Windows XP

by: Mike Dee | last post by:

A very very basic UTF-8 question that's driving me nuts: If I have this in the beginning of my Python script in Linux: #!/usr/bin/env python # -*- coding: UTF-8 -*- should I - or should I...

Python

5693

French "No" character entity

by: Haines Brown | last post by:

I'm having trouble finding the character entity for the French abbreviation for "number" (capital N followed by a small supercript o, period). My references are not listing it. Where would I...

HTML / CSS

18711

LoadXML and UTF-8 encoding

by: jmgonet | last post by:

Hello everybody, I'm having troubles loading a Xml string encoded in UTF-8. If I try this code: ------------------------------ XmlDocument doc=new XmlDocument(); String s="<?xml...

.NET Framework

13873

Unicode and utf 8 /utf 16

by: archana | last post by:

Hi all, can someone tell me difference between unicode and utf 8 or utf 18 and which one is supporting more character set. whic i should use to support character ucs-2. I want to use ucs-2...

C# / C Sharp

7268

Printing UTF-8

by: sheldon.regular | last post by:

I am new to unicode so please bear with my stupidity. I am doing the following in a Python IDE called Wing with Python 23. Ã¤Ã¶Ã¼ Ã¤Ã¶Ã¼ '\xc3\xa4\xc3\xb6\xc3\xbc' u'\xe4\xf6\xfc'...

Python

2333

UTF-8 encoding problem

by: shreshth.luthra | last post by:

Hi All, I am having a GUI which accepts a Unicode string and searches a given set of xml files for that string. Now, i have 2 XML files both of them saved in UTF-8 format, having characters...

Visual Basic .NET

4985

UTF-8 encoding in AJAX web application.

by: Allan Ebdrup | last post by:

I hava an ajax web application where i hvae problems with UTF-8 encoding oc chineese chars. My Ajax webapplication runs in a HTML page that is UTF-8 Encoded. I copy and paste some chineese chars...

C# / C Sharp

4297

More elegant UTF-8 encoder

by: Bjoern Hoehrmann | last post by:

Hi, For a free software project, I had to write a routine that, given a Unicode scalar value U+0000 - U+10FFFF, returns an integer that holds the UTF-8 encoded form of it, for example, U+00F6...

C / C++

7106

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

6967

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7181

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7349

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

4874

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

4565

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

3076

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

600

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

267

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General