Hi all,
is there any kind of 'hiconv' or other (unix-like) conversion tool that
would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
The database output is UTF-8 or UTF-16 only - Thus almost every character
starts with ^@.
I've seen e.g. http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as
JavaScript decoder - but maybe there's a recommended little helper that
could do:
- get rid of UTF-8 declarations where Latin is good enough
- convert others to most widely used HTML
such as Bulgarian and Russian charsets to 
..;
.... or even ISO-Latin-1 8bits to HTML
ä -> ä
.... and maybe
EUR -> €
Thanks,
Martin 22 11832
In article <sl******************@id-685.user.individual.de>,
Martin Trautmann <tr***@gmx.de> wrote: Hi all,
is there any kind of 'hiconv' or other (unix-like) conversion tool that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
The database output is UTF-8 or UTF-16 only - Thus almost every character starts with ^@.
I've seen e.g. http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5 as JavaScript decoder - but maybe there's a recommended little helper that could do:
- get rid of UTF-8 declarations where Latin is good enough - convert others to most widely used HTML
such as Bulgarian and Russian charsets to 
..;
... or even ISO-Latin-1 8bits to HTML ä -> ä ... and maybe EUR -> €
What's the problem with just using the UTF-8 output as is?
--
= Eric Bustad, Norwegian bachelor programmer
> is there any kind of 'hiconv' or other (unix-like) conversion tool
that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
Martin,
HTML4 can be UTF-8; just serve it as content-type: text/html;
charset=utf-8. Alternatively put a META tag in the header that
declares it as such. Long ago HTML was restricted to Latin1 but that
is history. (Maybe there is more to this than you are telling us?)
Anyway, if you really want to convert Unicode to latin1 + html
character entities, I believe that GNU recode can do what you want:
$ recode utf8..html
Martel est considéré comme "père" de la spéléologie moderne
Martel est considéré comme "père" de la
spéléologie moderne
(Whether that example looks right depends on what happens to this post
between me and you...)
--Phil.
1. down load DELO editor http://www.russiantext.ircdb.org/ruseditE.htm
2. type any Cyrillic text (editor works at NT/2000/XP, no Cyrillic font
setup or Cyrillic driver is required)
3. Select all or part of Cyrillic text
4. go to Format->Code convertion->Symbols to HTML Decimal Menu item
5. click and Cyrillic "privet" text will be converted to
привет
Bye
Smike http://smike.ru http://xedit.smike.ru
Martin Trautmann wrote: Hi all,
is there any kind of 'hiconv' or other (unix-like) conversion tool
that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
The database output is UTF-8 or UTF-16 only - Thus almost every
character starts with ^@.
I've seen e.g. http://aktuell.de.selfhtml.org/artik...64/utf8.htm#a5
as JavaScript decoder - but maybe there's a recommended little helper
that could do:
- get rid of UTF-8 declarations where Latin is good enough - convert others to most widely used HTML
such as Bulgarian and Russian charsets to 
..;
... or even ISO-Latin-1 8bits to HTML ä -> ä ... and maybe EUR -> € Thanks, Martin
On 1 Mar 2005 17:35:20 GMT, Eric Kenneth Bustad wrote: What's the problem with just using the UTF-8 output as is?
it's not really UTF-8, but UTF-16. I don't have a major problem to make
UTF-8 from UTF-16 - just stripping the other byte. But then there are
some remaining chars which should get some further translation.
Example (sorry if my charset is incorrect):
->
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
->
<h3><a name="01-5">Russian</a></h3><i>&25B >;H51AB20</i>. EKSMO-Press, 2001-07 (5-04-007995-8)<br>
->
<h3><a name="01-5">Russian</a></h3><i>Цвет Волшебства</i>. EKSMO-Press, 2001-07 (5-04-007995-8)<br>
On 1 Mar 2005 11:07:24 -0800, ph*******@treefic.com wrote: is there any kind of 'hiconv' or other (unix-like) conversion tool that would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
Martin,
HTML4 can be UTF-8; just serve it as content-type: text/html; charset=utf-8. Alternatively put a META tag in the header that declares it as such. Long ago HTML was restricted to Latin1 but that is history. (Maybe there is more to this than you are telling us?)
I guess so - I suppose I should have written about UTF-16 instead of
UTF-8. UTF-8 could be just fine. However, the current UTF-16 is hardly
readable - as source file.
Anyway, if you really want to convert Unicode to latin1 + html character entities, I believe that GNU recode can do what you want:
$ recode utf8..html Martel est considéré comme "père" de la spéléologie moderne Martel est considéré comme "père" de la spéléologie moderne
ah, this would be an option. However, I'll have to exclude the real
<, > ", &
Thanks,
Martin
On 1 Mar 2005 13:14:20 -0800, Smike wrote: 1. down load DELO editor http://www.russiantext.ircdb.org/ruseditE.htm
2. type any Cyrillic text (editor works at NT/2000/XP, no Cyrillic font setup or Cyrillic driver is required)
Thanks, but that's not an option: is there any kind of 'hiconv' or other (unix-like) conversion tool that
^^^^^^^^^ would convert UTF-8 to HTML (ISO-Latin-1 and Unicode)?
I don't have any Win, but Solaris and Mac OSX.
Martin Trautmann wrote: On 1 Mar 2005 17:35:20 GMT, Eric Kenneth Bustad wrote:
What's the problem with just using the UTF-8 output as is?
it's not really UTF-8, but UTF-16. I don't have a major problem to make UTF-8 from UTF-16 - just stripping the other byte.
Stripping away one byte from a UTF-16 character does not necessarily
produce the equivalent UTF-8 character. That would only apply to the
US-ASCII subset of Unicode, for which the high-order octet is set to 0
and the low-order octet matches the US-ASCII character, but does not
apply to any other character.
However any program that performed such a conversion like that is
extremely broken. Read the Unicode spec and implement it properly, or
use some existing libraries that have already been written correctly for
the job, just don't implement it the way you suggested above.
--
Lachlan Hunt http://lachy.id.au/ http://GetFirefox.com/ Rediscover the Web http://GetThunderbird.com/ Reclaim your Inbox
>>> convert UTF-8 to HTML (ISO-Latin-1 and Unicode)? HTML4 can be UTF-8; I guess so - I suppose I should have written about UTF-16 instead of UTF-8. UTF-8 could be just fine. However, the current UTF-16 is
hardly readable - as source file
This is the best solution; just "recode UTF-16..UTF-8" and set up the
server to serve as UTF-8. If you're using Apache beware of how it sets
the default character set - the HTTP header's character set overrides
anything set in a META tag, and Apache now sets latin-1 as a default
(there are security issues behind this I believe). This caused me a
few hours of debugging once so I thought I'd mention it. $ recode utf8..html ah, this would be an option. However, I'll have to exclude the real <, > ", &
Then you need -d:
$ recode -d utf8..html
<h1>¿Dondé?</h1>
<h1>¿Dondé?</h1>
--Phil.
Martin Trautmann wrote: On 1 Mar 2005 11:07:24 -0800, ph*******@treefic.com wrote: HTML4 can be UTF-8; just serve it as content-type: text/html; charset=utf-8...
I guess so - I suppose I should have written about UTF-16 instead of UTF-8. UTF-8 could be just fine. However, the current UTF-16 is hardly readable - as source file.
Why not? You just need to get an editor that supports UTF-16, rather
than an editor that only supports ISO-8859-1 (though most usually
support Windows-1252 and call it ISO-8859-1, or equivalent, anyway) Anyway, if you really want to convert Unicode to latin1 + html character entities, I believe that GNU recode can do what you want:
$ recode utf8..html Martel est considéré comme "père" de la spéléologie moderne Martel est considéré comme "père" de la spéléologie moderne
What about for Unicode characters that aren't included within the
subsets with named character entity references in HTML4? Does it
produce numeric character references instead?
--
Lachlan Hunt http://lachy.id.au/ http://GetFirefox.com/ Rediscover the Web http://GetThunderbird.com/ Reclaim your Inbox
On Wed, 02 Mar 2005 23:21:35 +1100, Lachlan Hunt wrote: Stripping away one byte from a UTF-16 character does not necessarily produce the equivalent UTF-8 character. That would only apply to the US-ASCII subset of Unicode, for which the high-order octet is set to 0 and the low-order octet matches the US-ASCII character, but does not apply to any other character.
True - but that's exactly the range where I wanted ASCII/Latin instead
of UTF-16.
However any program that performed such a conversion like that is extremely broken. Read the Unicode spec and implement it properly, or use some existing libraries that have already been written correctly for the job, just don't implement it the way you suggested above.
That's why I was looking for a conversion tool. recode -d u6..h $file
is exactly what I was looking for.
Thanks,
Martin
On 2 Mar 2005 04:30:56 -0800, ph*******@treefic.com wrote: $ recode utf8..html ah, this would be an option. However, I'll have to exclude the real <, > ", &
Then you need -d:
$ recode -d utf8..html <h1>¿Dondé?</h1> <h1>¿Dondé?</h1>
Thanks - that's perfect. I would not have recognized the -d option
"convert only diacritics or alike for HTML/LaTeX"
as the best option.
Martin
On Wed, 02 Mar 2005 23:31:15 +1100, Lachlan Hunt wrote: What about for Unicode characters that aren't included within the subsets with named character entity references in HTML4? Does it produce numeric character references instead?
Could you send me a sample?
recode does not produce the named characters first or at all - a named
output from my html editor is e.g.
Greek Θανατηφόρ&omi cron;ς Βοηθός
where recode does produce
Greek Θανατηφόρος Β_
9;ηθός
- Martin
>> $ recode -d utf8..html Thanks - that's perfect. I would not have recognized the -d option "convert only diacritics or alike for HTML/LaTeX" as the best option.
No, not from that description! Try this for the better documentation
(near the end):
$ info recode HTML
--Phil.
>> $ recode utf8..html What about for Unicode characters that aren't included within the subsets with named character entity references in HTML4? Does it produce numeric character references instead?
Yes. Again, whether you see this depends on what happens to this
message between me and you. Appologies to any thai readers, I've just
copied and pasted this at random.
$ recode utf8..html
สื่à¸*ฮ่à¸*งà¸à¸‡à¸£à¸²à¸¢à¸‡à¸²à¸™à¸ §à¹ˆà¸²à¸™à¸²à¸¢à¸•ุง
ชี หว่า
สื่อฮ่องกงรายงานว่านายตุง
ชี หว่า
Quote from the documentation:
Codes not having a mnemonic entity are output by `recode' using the
`&#NNN;' notation, where NNN is a decimal representation of the UCS
code value. When there is an entity name for a character, it is always
preferred over a numeric character reference. ASCII printable
characters are always generated directly. So is the newline.
See for example: http://www.delorie.com/gnu/docs/recode/recode_49.html
Phil.
On 2 Mar 2005 05:50:50 -0800, ph*******@treefic.com wrote: When there is an entity name for a character, it is always preferred over a numeric character reference.
Oops - you're right. It was my HTML editor which created numbers
instead of names, while recode used the name.
thanks,
Martin
On Wed, 2 Mar 2005, Lachlan Hunt wrote: Martin Trautmann wrote: it's not really UTF-8, but UTF-16. I don't have a major problem to make UTF-8 from UTF-16 - just stripping the other byte.
That would - at best - make iso-8859-1, rather than utf-8. But
horribly wrong if the stripped byte wasn't zero.
Stripping away one byte from a UTF-16 character does not necessarily produce the equivalent UTF-8 character. That would only apply to the US-ASCII subset
Indeed. us-ascii is indistinguishable from iso-8859-1 or utf-8 under
those restricted circumstances.
It you want a single additional step that does this code conversion,
then I'd be looking at recode (as others have also said). But it
would be better to look at existing process and identify a point at
which characters can be transparently converted as part of the
process. XML processors can typically do this with appropriate
parameters, for example, if an XML process is already involved.
Database access interfaces may be able to recode the data too, and so
on.
However any program that performed such a conversion like that is extremely broken. Read the Unicode spec and implement it properly, or use some existing libraries that have already been written correctly for the job, just don't implement it the way you suggested above.
Oh, quite.
Btw, for manual processes, Mozilla Composer seems to do quite a
reasonable job when you "save and change character encoding". Back in
Netscape (<=4) days we used to call it "Netscape Composter", but it's
much better in its Mozilla embodiment, AFAICS.
On Wed, 2 Mar 2005 14:14:42 +0000, Alan J. Flavell wrote: On Wed, 2 Mar 2005, Lachlan Hunt wrote:
Martin Trautmann wrote: it's not really UTF-8, but UTF-16. I don't have a major problem to make UTF-8 from UTF-16 - just stripping the other byte. That would - at best - make iso-8859-1, rather than utf-8. But horribly wrong if the stripped byte wasn't zero.
I was assuming a zero byte - whatever it is that is shown in vim as But it would be better to look at existing process and identify a point at which characters can be transparently converted as part of the process. XML processors can typically do this with appropriate parameters, for example, if an XML process is already involved.
We are talking about a 'stupid' data export to a text file which happens
to be UTF-16.
Btw, for manual processes, Mozilla Composer seems to do quite a reasonable job when you "save and change character encoding". Back in Netscape (<=4) days we used to call it "Netscape Composter", but it's much better in its Mozilla embodiment, AFAICS.
Indeed - and that was my manual process first and was my reference for
comparison of the results.
I was surprised that it did not attempt to 'improve' anything that I
built myself (such as repairing "<a name=x>x</a>" -> "<a name=x></a>x")
It did fix e.g. </i><i> or <i></i> - and it did not create new
paragraphs on its own where the composter always exchanged 'random'
amounts of <p> and <br>.
Even better: it created validator approved HTML ;-)
However, I did not want to use Mozilla as my required UTF-16 convertor.
- Martin
In <11**********************@f14g2000cwb.googlegroups .com>, on
03/01/2005
at 11:07 AM, ph*******@treefic.com said: Anyway, if you really want to convert Unicode to latin1 + html character entities, I believe that GNU recode can do what you want:
What about bidi text?
--
Shmuel (Seymour J.) Metz, SysProg and JOAT <http://patriot.net/~shmuel>
Unsolicited bulk E-mail subject to legal action. I reserve the
right to publicly post or ridicule any abusive E-mail. Reply to
domain Patriot dot net user shmuel+news to contact me. Do not
reply to sp******@library.lspace.org
>>Anyway, if you really want to convert Unicode to latin1 + html character entities, I believe that GNU recode can do what you want: What about bidi text?
Good question. I believe that it generates one numeric character
entity for each unicode character without special consideration for
this. I think that this should just work, if the browser does the
right thing, shouldn't it?
--Phil.
On Wed, 2 Mar 2005, Shmuel (Seymour J.) Metz wrote: In <11**********************@f14g2000cwb.googlegroups .com>, on 03/01/2005 at 11:07 AM, ph*******@treefic.com said:
Anyway, if you really want to convert Unicode to latin1 + html character entities, I believe that GNU recode can do what you want:
What about bidi text?
What about it? There's no reason that a mere change of character
encoding should have any effect on bidi properties. Arabic and logical
Hebrew should simply *work* (at least as well as they ever work in
your choice of browser).
See also http://www.nirdagan.com/hebrew/compare
So where do you foresee a problem?
Martin Trautmann wrote: On Wed, 02 Mar 2005 23:21:35 +1100, Lachlan Hunt wrote:
Stripping away one byte from a UTF-16 character does not necessarily produce the equivalent UTF-8 character. That would only apply to the US-ASCII subset of Unicode, for which the high-order octet is set to 0 and the low-order octet matches the US-ASCII character, but does not apply to any other character.
True - but that's exactly the range where I wanted ASCII/Latin instead of UTF-16.
It won't work for the latin range, only the US-ASCII range I said.
US-ASCII: Code positions 0 to 127 (decimal)
In UTF-8, these are encoded as single octets which are identical to US-ASCII
Latin: Code positions 128 to 255
In UTF-8, these (and everything above) are encoded using multiple
octets. I think all of this range is encoded as 2 octets, but they *do
not* match the UTF-16 encoding.
eg. Encoding the copyright symbol: © (decimal 169)
UTF-8: 0xC2 0xA9
UTF-16: 0x00 0xA9
ISO-8859-1: 0xA9 (this is a single-octet encoding)
That's why stripping the other byte will simply not work to covert
UTF-16 to UTF-8. Stripping the first byte also won't work for any
character above 255 (outside the ISO-8859-1 range).
--
Lachlan Hunt http://lachy.id.au/ http://GetFirefox.com/ Rediscover the Web http://GetThunderbird.com/ Reclaim your Inbox
On Thu, 03 Mar 2005 08:50:11 +1100, Lachlan Hunt wrote: It won't work for the latin range, only the US-ASCII range I said.
Thanks for your further explanation. I actually did not have any
critical problems within this range yet. But I felt it was the wrong
move, apart from being incomplete.
Martin This discussion thread is closed Replies have been disabled for this discussion. Similar topics
9 posts
views
Thread by lawrence |
last post: by
|
4 posts
views
Thread by Alban Hertroys |
last post: by
|
12 posts
views
Thread by Mike Dee |
last post: by
|
38 posts
views
Thread by Haines Brown |
last post: by
|
6 posts
views
Thread by jmgonet |
last post: by
|
6 posts
views
Thread by archana |
last post: by
|
1 post
views
Thread by sheldon.regular |
last post: by
|
4 posts
views
Thread by shreshth.luthra |
last post: by
|
23 posts
views
Thread by Allan Ebdrup |
last post: by
|
35 posts
views
Thread by Bjoern Hoehrmann |
last post: by
| | | | | | | | | | |