469,342 Members | 5,574 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,342 developers. It's quick & easy.

UTF-8 to HTML conversion


Hi all,

is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

The database output is UTF-8 or UTF-16 only - Thus almost every character
starts with ^@.

I've seen e.g.
http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as
JavaScript decoder - but maybe there's a recommended little helper that
could do:

- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML

such as Bulgarian and Russian charsets to &#10..;
.... or even ISO-Latin-1 8bits to HTML
-> ä
.... and maybe
EUR -> €

Thanks,
Martin
Jul 23 '05 #1
22 11625
In article <sl******************@id-685.user.individual.de>,
Martin Trautmann <tr***@gmx.de> wrote:

Hi all,

is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

The database output is UTF-8 or UTF-16 only - Thus almost every character
starts with ^@.

I've seen e.g.
http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as
JavaScript decoder - but maybe there's a recommended little helper that
could do:

- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML

such as Bulgarian and Russian charsets to &#10..;
... or even ISO-Latin-1 8bits to HTML
-> &auml;
... and maybe
EUR -> &euro;


What's the problem with just using the UTF-8 output as is?

--
= Eric Bustad, Norwegian bachelor programmer
Jul 23 '05 #2
> is there any kind of 'hiconv' or other (unix-like) conversion tool
that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?


Martin,

HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8. Alternatively put a META tag in the header that
declares it as such. Long ago HTML was restricted to Latin1 but that
is history. (Maybe there is more to this than you are telling us?)

Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

$ recode utf8..html
Martel est considr comme "pre" de la splologie moderne
Martel est consid&eacute;r&eacute; comme &quot;p&egrave;re&quot; de la
sp&eacute;l&eacute;ologie moderne

(Whether that example looks right depends on what happens to this post
between me and you...)

--Phil.

Jul 23 '05 #3
1. down load DELO editor
http://www.russiantext.ircdb.org/ruseditE.htm

2. type any Cyrillic text (editor works at NT/2000/XP, no Cyrillic font
setup or Cyrillic driver is required)

3. Select all or part of Cyrillic text

4. go to Format->Code convertion->Symbols to HTML Decimal Menu item

5. click and Cyrillic "privet" text will be converted to
привет

Bye
Smike
http://smike.ru
http://xedit.smike.ru
Martin Trautmann wrote:
Hi all,

is there any kind of 'hiconv' or other (unix-like) conversion tool that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?

The database output is UTF-8 or UTF-16 only - Thus almost every character starts with ^@.

I've seen e.g.
http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as JavaScript decoder - but maybe there's a recommended little helper that could do:

- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML

such as Bulgarian and Russian charsets to &#10..;
... or even ISO-Latin-1 8bits to HTML
-> &auml;
... and maybe
EUR -> &euro;

Thanks,
Martin


Jul 23 '05 #4
On 1 Mar 2005 17:35:20 GMT, Eric Kenneth Bustad wrote:
What's the problem with just using the UTF-8 output as is?


it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte. But then there are
some remaining chars which should get some further translation.

Example (sorry if my charset is incorrect):

->

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">


->

<h3><a name="01-5">Russian</a></h3><i>&25B >;H51AB20</i>. EKSMO-Press, 2001-07 (5-04-007995-8)<br>

->

<h3><a name="01-5">Russian</a></h3><i>Цвет Волшебства</i>. EKSMO-Press, 2001-07 (5-04-007995-8)<br>
Jul 23 '05 #5
On 1 Mar 2005 11:07:24 -0800, ph*******@treefic.com wrote:
is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?


Martin,

HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8. Alternatively put a META tag in the header that
declares it as such. Long ago HTML was restricted to Latin1 but that
is history. (Maybe there is more to this than you are telling us?)


I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is hardly
readable - as source file.
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

$ recode utf8..html
Martel est considr comme "pre" de la splologie moderne
Martel est consid&eacute;r&eacute; comme &quot;p&egrave;re&quot; de la
sp&eacute;l&eacute;ologie moderne


ah, this would be an option. However, I'll have to exclude the real
&lt;, &gt; &quot;, &amp;
Thanks,
Martin
Jul 23 '05 #6
On 1 Mar 2005 13:14:20 -0800, Smike wrote:
1. down load DELO editor
http://www.russiantext.ircdb.org/ruseditE.htm

2. type any Cyrillic text (editor works at NT/2000/XP, no Cyrillic font
setup or Cyrillic driver is required)


Thanks, but that's not an option:
is there any kind of 'hiconv' or other (unix-like) conversion tool that ^^^^^^^^^ would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?


I don't have any Win, but Solaris and Mac OSX.
Jul 23 '05 #7
Martin Trautmann wrote:
On 1 Mar 2005 17:35:20 GMT, Eric Kenneth Bustad wrote:
What's the problem with just using the UTF-8 output as is?

it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte.


Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.

However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly, or
use some existing libraries that have already been written correctly for
the job, just don't implement it the way you suggested above.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jul 23 '05 #8
>>> convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
HTML4 can be UTF-8; I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is

hardly readable - as source file


This is the best solution; just "recode UTF-16..UTF-8" and set up the
server to serve as UTF-8. If you're using Apache beware of how it sets
the default character set - the HTTP header's character set overrides
anything set in a META tag, and Apache now sets latin-1 as a default
(there are security issues behind this I believe). This caused me a
few hours of debugging once so I thought I'd mention it.
$ recode utf8..html

ah, this would be an option. However, I'll have to exclude the real
&lt;, &gt; &quot;, &amp;


Then you need -d:

$ recode -d utf8..html
<h1>Dond?</h1>
<h1>&iquest;Dond&eacute;?</h1>

--Phil.

Jul 23 '05 #9
Martin Trautmann wrote:
On 1 Mar 2005 11:07:24 -0800, ph*******@treefic.com wrote:
HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8...


I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is hardly
readable - as source file.


Why not? You just need to get an editor that supports UTF-16, rather
than an editor that only supports ISO-8859-1 (though most usually
support Windows-1252 and call it ISO-8859-1, or equivalent, anyway)
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

$ recode utf8..html
Martel est considéré comme "père" de la spéléologie moderne
Martel est consid&eacute;r&eacute; comme &quot;p&egrave;re&quot; de la
sp&eacute;l&eacute;ologie moderne


What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jul 23 '05 #10
On Wed, 02 Mar 2005 23:21:35 +1100, Lachlan Hunt wrote:
Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.
True - but that's exactly the range where I wanted ASCII/Latin instead
of UTF-16.
However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly, or
use some existing libraries that have already been written correctly for
the job, just don't implement it the way you suggested above.


That's why I was looking for a conversion tool. recode -d u6..h $file
is exactly what I was looking for.

Thanks,
Martin
Jul 23 '05 #11
On 2 Mar 2005 04:30:56 -0800, ph*******@treefic.com wrote:
$ recode utf8..html

ah, this would be an option. However, I'll have to exclude the real
&lt;, &gt; &quot;, &amp;


Then you need -d:

$ recode -d utf8..html
<h1>Dond?</h1>
<h1>&iquest;Dond&eacute;?</h1>


Thanks - that's perfect. I would not have recognized the -d option
"convert only diacritics or alike for HTML/LaTeX"
as the best option.

Martin
Jul 23 '05 #12
On Wed, 02 Mar 2005 23:31:15 +1100, Lachlan Hunt wrote:
What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?


Could you send me a sample?

recode does not produce the named characters first or at all - a named
output from my html editor is e.g.

Greek &Theta;&alpha;&nu;&alpha;&tau;&eta;&phi;ό&rho;&omi cron;&sigmaf; &Beta;&omicron;&eta;&theta;ό&sigmaf;

where recode does produce
Greek Θανατηφόρος Β&#95
9;ηθός

- Martin
Jul 23 '05 #13
>> $ recode -d utf8..html
Thanks - that's perfect. I would not have recognized the -d option
"convert only diacritics or alike for HTML/LaTeX"
as the best option.


No, not from that description! Try this for the better documentation
(near the end):

$ info recode HTML

--Phil.

Jul 23 '05 #14
>> $ recode utf8..html
What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?


Yes. Again, whether you see this depends on what happens to this
message between me and you. Appologies to any thai readers, I've just
copied and pasted this at random.

$ recode utf8..html
สื่*ฮ่*งกงรายงาน ่านายตุง
ชี หว่า
สื่อฮ่องกงรายงานว่านายตุง
ชี หว่า

Quote from the documentation:

Codes not having a mnemonic entity are output by `recode' using the
`&#NNN;' notation, where NNN is a decimal representation of the UCS
code value. When there is an entity name for a character, it is always
preferred over a numeric character reference. ASCII printable
characters are always generated directly. So is the newline.
See for example:

http://www.delorie.com/gnu/docs/recode/recode_49.html

Phil.

Jul 23 '05 #15
On 2 Mar 2005 05:50:50 -0800, ph*******@treefic.com wrote:
When there is an entity name for a character, it is always
preferred over a numeric character reference.


Oops - you're right. It was my HTML editor which created numbers
instead of names, while recode used the name.

thanks,
Martin
Jul 23 '05 #16
On Wed, 2 Mar 2005, Lachlan Hunt wrote:
Martin Trautmann wrote:
it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte.

That would - at best - make iso-8859-1, rather than utf-8. But
horribly wrong if the stripped byte wasn't zero.
Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to
the US-ASCII subset
Indeed. us-ascii is indistinguishable from iso-8859-1 or utf-8 under
those restricted circumstances.

It you want a single additional step that does this code conversion,
then I'd be looking at recode (as others have also said). But it
would be better to look at existing process and identify a point at
which characters can be transparently converted as part of the
process. XML processors can typically do this with appropriate
parameters, for example, if an XML process is already involved.

Database access interfaces may be able to recode the data too, and so
on.
However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly,
or use some existing libraries that have already been written
correctly for the job, just don't implement it the way you suggested
above.


Oh, quite.

Btw, for manual processes, Mozilla Composer seems to do quite a
reasonable job when you "save and change character encoding". Back in
Netscape (<=4) days we used to call it "Netscape Composter", but it's
much better in its Mozilla embodiment, AFAICS.
Jul 23 '05 #17
On Wed, 2 Mar 2005 14:14:42 +0000, Alan J. Flavell wrote:
On Wed, 2 Mar 2005, Lachlan Hunt wrote:
Martin Trautmann wrote:
it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte.

That would - at best - make iso-8859-1, rather than utf-8. But
horribly wrong if the stripped byte wasn't zero.


I was assuming a zero byte - whatever it is that is shown in vim as But it
would be better to look at existing process and identify a point at
which characters can be transparently converted as part of the
process. XML processors can typically do this with appropriate
parameters, for example, if an XML process is already involved.
We are talking about a 'stupid' data export to a text file which happens
to be UTF-16.
Btw, for manual processes, Mozilla Composer seems to do quite a
reasonable job when you "save and change character encoding". Back in
Netscape (<=4) days we used to call it "Netscape Composter", but it's
much better in its Mozilla embodiment, AFAICS.


Indeed - and that was my manual process first and was my reference for
comparison of the results.

I was surprised that it did not attempt to 'improve' anything that I
built myself (such as repairing "<a name=x>x</a>" -> "<a name=x></a>x")
It did fix e.g. </i><i> or <i></i> - and it did not create new
paragraphs on its own where the composter always exchanged 'random'
amounts of <p> and <br>.

Even better: it created validator approved HTML ;-)
However, I did not want to use Mozilla as my required UTF-16 convertor.

- Martin
Jul 23 '05 #18
In <11**********************@f14g2000cwb.googlegroups .com>, on
03/01/2005
at 11:07 AM, ph*******@treefic.com said:
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:


What about bidi text?

--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>

Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org

Jul 23 '05 #19
>>Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:

What about bidi text?


Good question. I believe that it generates one numeric character
entity for each unicode character without special consideration for
this. I think that this should just work, if the browser does the
right thing, shouldn't it?

--Phil.

Jul 23 '05 #20
On Wed, 2 Mar 2005, Shmuel (Seymour J.) Metz wrote:
In <11**********************@f14g2000cwb.googlegroups .com>, on
03/01/2005
at 11:07 AM, ph*******@treefic.com said:
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:


What about bidi text?


What about it? There's no reason that a mere change of character
encoding should have any effect on bidi properties. Arabic and logical
Hebrew should simply *work* (at least as well as they ever work in
your choice of browser).

See also http://www.nirdagan.com/hebrew/compare

So where do you foresee a problem?
Jul 23 '05 #21
Martin Trautmann wrote:
On Wed, 02 Mar 2005 23:21:35 +1100, Lachlan Hunt wrote:
Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.

True - but that's exactly the range where I wanted ASCII/Latin instead
of UTF-16.


It won't work for the latin range, only the US-ASCII range I said.

US-ASCII: Code positions 0 to 127 (decimal)
In UTF-8, these are encoded as single octets which are identical to US-ASCII
Latin: Code positions 128 to 255
In UTF-8, these (and everything above) are encoded using multiple
octets. I think all of this range is encoded as 2 octets, but they *do
not* match the UTF-16 encoding.

eg. Encoding the copyright symbol: © (decimal 169)
UTF-8: 0xC2 0xA9
UTF-16: 0x00 0xA9
ISO-8859-1: 0xA9 (this is a single-octet encoding)

That's why stripping the other byte will simply not work to covert
UTF-16 to UTF-8. Stripping the first byte also won't work for any
character above 255 (outside the ISO-8859-1 range).

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox
Jul 23 '05 #22
On Thu, 03 Mar 2005 08:50:11 +1100, Lachlan Hunt wrote:
It won't work for the latin range, only the US-ASCII range I said.


Thanks for your further explanation. I actually did not have any
critical problems within this range yet. But I felt it was the wrong
move, apart from being incomplete.

Martin
Jul 23 '05 #23

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

9 posts views Thread by lawrence | last post: by
4 posts views Thread by Alban Hertroys | last post: by
38 posts views Thread by Haines Brown | last post: by
6 posts views Thread by jmgonet | last post: by
6 posts views Thread by archana | last post: by
1 post views Thread by sheldon.regular | last post: by
4 posts views Thread by shreshth.luthra | last post: by
23 posts views Thread by Allan Ebdrup | last post: by
35 posts views Thread by Bjoern Hoehrmann | last post: by
1 post views Thread by CARIGAR | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.