473,387 Members | 1,504 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,387 software developers and data experts.

character to HTML ampersand escape sequence converter

Hello,
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?

thx.

Jul 23 '05 #1
18 14718
SwordAngel wrote:
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?


IIRC Tidy will do that.

http://tidy.sf.net/

--
David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #2
* David Dorward wrote in comp.infosystems.www.authoring.html:
SwordAngel wrote:
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?


IIRC Tidy will do that.


Well, yes, but only for character encodings it supports (and it does not
support any of the encodings SwordAngel listed to that extend). Windows
users can compile Tidy with an experimental feature that enables support
for all character encodings Windows / Internet Explorer support via the
TIDY_WIN32_MLANG_SUPPORT #define, but it is generally better to use ex-
ternal tools such as iconv, piconv, uconv, recode, ... to convert the
document to UTF-8 and let Tidy process the document accordingly.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #3
On Fri, 17 Dec 2004, SwordAngel wrote:
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?


"free recode" ? http://recode.progiciels-bpi.ca/

Call it with something like:
recode -d euc-jp..h4 < input.html > output.html

That won't do anything to tidy up the HTML, though, unlike Tidy ;-)

And don't forget that when you've translated language-specific
encodings into Han-unified Unicode characters, you should mark-up
the source with the correct language attribute in order to get
the right rendering of the unified characters. At least that's my
understanding (I can't actually read them myself).
Jul 23 '05 #4
In article <41****************@news.bjoern.hoehrmann.de>,
Bjoern Hoehrmann <de*******@gmx.net> writes:
IIRC Tidy will do that.

Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
Well, yes, but only for character encodings it supports (and it does not
support any of the encodings SwordAngel listed to that extend).
Indeed, libxml2 (last time I checked) supports some but not all of
those encodings, so the same limitation applies.

Have you considered tying in iconv to Tidy to improve i18n support?
but it is generally better to use ex-
ternal tools such as iconv, piconv, uconv, recode, ... to convert the
document to UTF-8 and let Tidy process the document accordingly.


I believe OpenSP supports all the encodings named, though I'm
not entirely sure OTTOMH. So there may still be a one-stop
program for the conversion. But as Björn says, a transcoder
such as iconv is a more general solution.
--
Nick Kew

Nick's manifesto: http://www.htmlhelp.com/~nick/
Jul 23 '05 #5
* Nick Kew wrote in comp.infosystems.www.authoring.html:
Have you considered tying in iconv to Tidy to improve i18n support?


I wrote an experimental iconv wrapper which is included in the source
distribution, but it is not plugged into the code, i.e., you need to
change a few things in order to use it. Development of these features
was put on hold until a better interface for pluggable transcoders for
Tidy has been developed (which has not happend yet).
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #6
In article <fg************@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).


A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so? Is there an equivalent of SAX
filter or somesuch that would make HTML appear to the app as XHTML?

TagSoup on the Java side appears to the app as an XML parser parsing
XHTML.

Has anyone compared the tag slurping features of TagSoup and libxml2? I
Wonder which one is a better idea when writing in Python: using libxml2
with CPython or using TagSoup with Jython?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #7
On Sat, 18 Dec 2004, Henri Sivonen wrote:
In article <fg************@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I
saw that (libxml2 accepts HTML as well as XML input).


A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so?


But does this matter, in the context of the original question?

Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?
Jul 23 '05 #8
* Alan J. Flavell wrote in comp.infosystems.www.authoring.html:
Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?


That does not work very well for comments, CDATA elements, processing
instructions, etc.
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #9
On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote:
* Alan J. Flavell wrote in comp.infosystems.www.authoring.html:
Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?
That does not work very well for comments,


Fortunately, HTML rendering agents don't need to interpret the content
of comments...
CDATA elements, processing instructions, etc.


Theoretically, of course, you are right; which is why I slipped-in
that qualification re. documents that are compatible with the WWW as
it exists.

I don't dispute that in theory you can produce counter-examples where
the simple method described above gives the wrong result, for the
reasons you gave; but I'm interested if a real-life example can be
produced where this would matter.

all the best
Jul 23 '05 #10
In article <Pi*******************************@ppepc56.ph.gla. ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla.ac.uk> wrote:
On Sat, 18 Dec 2004, Henri Sivonen wrote:
In article <fg************@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I
saw that (libxml2 accepts HTML as well as XML input).
A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so?


But does this matter, in the context of the original question?


Perhaps not. It was a new question in the spirit of "discussion
forum--not help desk". :-)
Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?


Yes, except comments change if they exist and contain non-ASCII.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #11
* Alan J. Flavell wrote in comp.infosystems.www.authoring.html:
I don't dispute that in theory you can produce counter-examples where
the simple method described above gives the wrong result, for the
reasons you gave; but I'm interested if a real-life example can be
produced where this would matter.


Consider a HTML document with

<style type="text/css">
q:lang(no) { quotes: "«" "»" '"' '"' }
</style>

or consider HTML documents with scripts such as those in

http://www.rfs.jp/sitebuilder/javascript/01/08.html
--
Björn Höhrmann · mailto:bj****@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #12
On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote:
Consider a HTML document with

<style type="text/css">
q:lang(no) { quotes: "«" "»" '"' '"' }
</style>

or consider HTML documents with scripts such as those in

http://www.rfs.jp/sitebuilder/javascript/01/08.html


OK, I concede.

Of course, if the target encoding was meant to be us-ascii with
&#bignumber; representations of non-ascii characters (which might have
been what the questioner had in mind, since I undestood the request to
be for &#bignumber; representation rather than actual utf-8-encoded
characters in the HTML part), then you'd need CSS-aware and
Javascript-aware converters to know how to represent those non-ascii
characters in their respective languages.

Indeed the W3C were wise in their XHTML documentation to recommend
moving those enclosures out into separate files rather than trying to
in-line them as CDATA ;-)
Jul 23 '05 #13
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so?


Yes, that's a reasonably fair summary. The HTML parser is the XML
parser with tolerance of non-XML and knowledge of HTML4.
Is there an equivalent of SAX
filter or somesuch that would make HTML appear to the app as XHTML?
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction. HTML mode is also tolerant
of tag-soup, though not quite as forgiving as a typical browser.
There are a few bugs wrt the spec: most obviously, it only recognises
XML comment syntax (but then, so do the browsers).

As a corollary, you can use it to apply XML processing to HTML.
TagSoup on the Java side appears to the app as an XML parser parsing
XHTML.
I'm not familiar with that, but it's not uncommon.
Has anyone compared the tag slurping features of TagSoup and libxml2? I
Wonder which one is a better idea when writing in Python: using libxml2
with CPython or using TagSoup with Jython?


Couldn't tell you. But I'd venture a strong guess that libxml2 will be
not only a great deal faster than anything-java, but also no harder
and possibly easier to work with.
--
Nick Kew
Jul 23 '05 #14
In article <fu************@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction.
Are the elements in the XHTML namespace or in no namespace? The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.
But I'd venture a strong guess that libxml2 will be
not only a great deal faster than anything-java, but also no harder
and possibly easier to work with.


I think I read somewhere that the libxml2 wrapper gives the Python side
UTF-8 byte strings instead of Python Unicode strings.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #15
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
>> Indeed. I was on the point of suggesting AN XML processor until I saw
>> that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction.


Are the elements in the XHTML namespace or in no namespace?


They're not namespaced. At least not in the SAX parse mode, which is
where I've investigated the issue. At least, my preliminary experiments
trying to use the HTML parser in SAX2 mode were not successful, which
is not to say I won't return to the issue.
The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.


Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The full capability is at best a pathological edge-case.

BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at
http://www.xml.com/pub/a/2004/12/15/...amespaces.html

--
Nick Kew
Jul 23 '05 #16
In article <cq***********@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:

>> Indeed. I was on the point of suggesting AN XML processor until I saw
>> that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction.


Are the elements in the XHTML namespace or in no namespace?


They're not namespaced.


That's a pity. Of course, it's possible to write a filter that takes
SAX1 events, adds the namespacing and emits SAX2 events, but it is
uncool to have to implement stuff that a library should be able to do
out of the box.
The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.


Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?


The people who export from MS Office?

I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.

However, I think it doesn't make sense to write the app internals for
namespaceless HTML so that massive rework is needed for XHTML+FooML. It
makes more sense to write the app internals for namespaced compound
documents and to convert HTML to XHTML at parse time. Using an XML
parser is the right way to go for XHTML and XHTML+FooML.
BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at
http://www.xml.com/pub/a/2004/12/15/...amespaces.html


Interesting.

BTW, how do you reconcile the GPL and the Apache license?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #17
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The people who export from MS Office?


Good catch. I'd forgotten that one. Don't they try/claim to be XHTML?
I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.


It would presumably work by treating it as XHTML. Like XPath, XSLT,
etc, which do work fine with HTML and the libxml2 parser.
BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at
http://www.xml.com/pub/a/2004/12/15/...amespaces.html


Interesting.

BTW, how do you reconcile the GPL and the Apache license?


Why is that a problem? My work is GPL (if you want it free - dual
licensing available otherwise). Apache is ASF license. They are
distributed separately. Those Linux distros (and FreeBSD) that
package my GPL modules offer them to users as separate packages,
and don't have a problem with it. Even the fundamentalists at
Debian don't have a problem with it. Any more than they have a
problem distributing non-GPL apps like Apache to run on Linux itself.

--
Nick Kew
Jul 23 '05 #18
In article <l8************@hugin.webthing.com>,
ni**@hugin.webthing.com (Nick Kew) wrote:
In article <hs****************************@news.dnainternet.n et>,
Henri Sivonen <hs******@iki.fi> writes:
Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The people who export from MS Office?


Good catch. I'd forgotten that one. Don't they try/claim to be XHTML?


I don't think so. It's more like HTML tag soup spiced up with colonified
names and XML "data islands".
I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.


It would presumably work by treating it as XHTML.


With namespaces in HTML I meant this kind of Microsoftism:

<HTML xmlns:k='urn:kewl-schema-urn'>
<HEAD>
<TITLE>Test</TITLE>
<xml>
<k:foo>
<k:bar/>
</k:foo>
</xml>
</HEAD>
<BODY>
....
</BODY>
</HTML>

(I suppose Microsoft has defined how that is supposed to work. So saying
it isn't defined was not entirely accurate.)
Why is that a problem?
The FSF lists the Apache licenses 1.0, 1.1 and 2.0 as GPL-incompatible
free software licenses.

http://www.fsf.org/licenses/license-...atibleLicenses
Even the fundamentalists at Debian don't have a problem with it.
That's surprising. :-)
Any more than they have a
problem distributing non-GPL apps like Apache to run on Linux itself.


IIRC, Linus Torvalds declared an exception when the subject came up.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #19

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: news.hunterlink.net.au | last post by:
(* note the escaped ampersand and the character reference have extra spaces to avoid being converted when viewed) I have a job that requires the following <ThisElement>Here is some text & a m...
1
by: Rob Morrison | last post by:
The sample below demonstates an issue that I cannot seem to workaround. I have an Url with a value that contains an ampersand. I have escaped the Url using both the hex value and it works fine...
9
by: Christian Kandeler | last post by:
Hi, if I want to store the string "123456" in a variable of type char, I can do it like this: char s = "123456"; Or like this: char s = { '1', '2', '3', '4', '5', '6', '\0' };
7
by: teachtiro | last post by:
Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %%...
12
by: Jeff S | last post by:
In a VB.NET code behind module, I build a string for a link that points to a JavaScript function. The two lines of code below show what is relevant. PopupLink = "javascript:PopUpWindow(" &...
15
by: pkaeowic | last post by:
I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN....
2
by: christopher taylor | last post by:
hello python-list! the other day, i was trying to match unicode character sequences that looked like this: \\uAD0X... my issue, is that the pattern i used was returning:
8
by: mdh | last post by:
Hi all, I have a file, whose path is: "/Users/m/k&R/test_file" How do I include the '&' in a string constant? ( I need this for the example on p162). I have tried to use the Hex notation...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.