473,788 Members | 2,725 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

character to HTML ampersand escape sequence converter

Hello,
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?

thx.

Jul 23 '05 #1
18 14750
SwordAngel wrote:
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?


IIRC Tidy will do that.

http://tidy.sf.net/

--
David Dorward <http://blog.dorward.me .uk/> <http://dorward.me.uk/>
Home is where the ~/.bashrc is
Jul 23 '05 #2
* David Dorward wrote in comp.infosystem s.www.authoring.html:
SwordAngel wrote:
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?


IIRC Tidy will do that.


Well, yes, but only for character encodings it supports (and it does not
support any of the encodings SwordAngel listed to that extend). Windows
users can compile Tidy with an experimental feature that enables support
for all character encodings Windows / Internet Explorer support via the
TIDY_WIN32_MLAN G_SUPPORT #define, but it is generally better to use ex-
ternal tools such as iconv, piconv, uconv, recode, ... to convert the
document to UTF-8 and let Tidy process the document accordingly.
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #3
On Fri, 17 Dec 2004, SwordAngel wrote:
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?


"free recode" ? http://recode.progiciels-bpi.ca/

Call it with something like:
recode -d euc-jp..h4 < input.html > output.html

That won't do anything to tidy up the HTML, though, unlike Tidy ;-)

And don't forget that when you've translated language-specific
encodings into Han-unified Unicode characters, you should mark-up
the source with the correct language attribute in order to get
the right rendering of the unified characters. At least that's my
understanding (I can't actually read them myself).
Jul 23 '05 #4
In article <41************ ****@news.bjoer n.hoehrmann.de> ,
Bjoern Hoehrmann <de*******@gmx. net> writes:
IIRC Tidy will do that.

Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
Well, yes, but only for character encodings it supports (and it does not
support any of the encodings SwordAngel listed to that extend).
Indeed, libxml2 (last time I checked) supports some but not all of
those encodings, so the same limitation applies.

Have you considered tying in iconv to Tidy to improve i18n support?
but it is generally better to use ex-
ternal tools such as iconv, piconv, uconv, recode, ... to convert the
document to UTF-8 and let Tidy process the document accordingly.


I believe OpenSP supports all the encodings named, though I'm
not entirely sure OTTOMH. So there may still be a one-stop
program for the conversion. But as Björn says, a transcoder
such as iconv is a more general solution.
--
Nick Kew

Nick's manifesto: http://www.htmlhelp.com/~nick/
Jul 23 '05 #5
* Nick Kew wrote in comp.infosystem s.www.authoring.html:
Have you considered tying in iconv to Tidy to improve i18n support?


I wrote an experimental iconv wrapper which is included in the source
distribution, but it is not plugged into the code, i.e., you need to
change a few things in order to use it. Development of these features
was put on hold until a better interface for pluggable transcoders for
Tidy has been developed (which has not happend yet).
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #6
In article <fg************ @hugin.webthing .com>,
ni**@hugin.webt hing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).


A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so? Is there an equivalent of SAX
filter or somesuch that would make HTML appear to the app as XHTML?

TagSoup on the Java side appears to the app as an XML parser parsing
XHTML.

Has anyone compared the tag slurping features of TagSoup and libxml2? I
Wonder which one is a better idea when writing in Python: using libxml2
with CPython or using TagSoup with Jython?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
Jul 23 '05 #7
On Sat, 18 Dec 2004, Henri Sivonen wrote:
In article <fg************ @hugin.webthing .com>,
ni**@hugin.webt hing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I
saw that (libxml2 accepts HTML as well as XML input).


A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so?


But does this matter, in the context of the original question?

Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?
Jul 23 '05 #8
* Alan J. Flavell wrote in comp.infosystem s.www.authoring.html:
Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?


That does not work very well for comments, CDATA elements, processing
instructions, etc.
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Jul 23 '05 #9
On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote:
* Alan J. Flavell wrote in comp.infosystem s.www.authoring.html:
Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?
That does not work very well for comments,


Fortunately, HTML rendering agents don't need to interpret the content
of comments...
CDATA elements, processing instructions, etc.


Theoretically, of course, you are right; which is why I slipped-in
that qualification re. documents that are compatible with the WWW as
it exists.

I don't dispute that in theory you can produce counter-examples where
the simple method described above gives the wrong result, for the
reasons you gave; but I'm interested if a real-life example can be
produced where this would matter.

all the best
Jul 23 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
2025
by: news.hunterlink.net.au | last post by:
(* note the escaped ampersand and the character reference have extra spaces to avoid being converted when viewed) I have a job that requires the following <ThisElement>Here is some text & a m p ; here is a & # x E 2 ; character</ThisElement> to end up as
1
1170
by: Rob Morrison | last post by:
The sample below demonstates an issue that I cannot seem to workaround. I have an Url with a value that contains an ampersand. I have escaped the Url using both the hex value and it works fine when used in a href. But, if I pass the same Url to the open() function it unescapes my ampersand while leaving the other escaped untouched. This behavior is the same for both IE and Mozilla Firefox, I guess this is known behvoir unknown to me. ...
9
3362
by: Christian Kandeler | last post by:
Hi, if I want to store the string "123456" in a variable of type char, I can do it like this: char s = "123456"; Or like this: char s = { '1', '2', '3', '4', '5', '6', '\0' };
7
96331
by: teachtiro | last post by:
Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %% should be used. Wouldn't it have been better (from design perspective) if the same escape character had been used in this case too. Forgive me for posting without verfying things with any standard compiler, i don't have the means for now.
12
9645
by: Jeff S | last post by:
In a VB.NET code behind module, I build a string for a link that points to a JavaScript function. The two lines of code below show what is relevant. PopupLink = "javascript:PopUpWindow(" & Chr(34) & PopUpWindowTitle & Chr(34) & ", " & Chr(34) & CurrentEventDetails & ")" strTemp += "<BR><A HREF='#' onClick='" & PopupLink & "'>" & EventName & "</A><BR>" The problem I have is that when the string variables or contain a string with an...
15
18321
by: pkaeowic | last post by:
I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN. Any idea if this is a bug? if (e.KeyChar == '\e') { this.Close(); }
2
1869
by: christopher taylor | last post by:
hello python-list! the other day, i was trying to match unicode character sequences that looked like this: \\uAD0X... my issue, is that the pattern i used was returning:
8
3076
by: mdh | last post by:
Hi all, I have a file, whose path is: "/Users/m/k&R/test_file" How do I include the '&' in a string constant? ( I need this for the example on p162). I have tried to use the Hex notation x26, as in "/Users/m/k\x26R/test_file".
0
9656
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9498
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10366
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9967
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8993
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6750
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5399
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5536
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4070
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.