character to HTML ampersand escape sequence converter - Page 2

SwordAngel

Hello,
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?

thx.

Jul 23 '05

Subscribe Reply

14751

Henri Sivonen

In article <Pi************ *************** ****@ppepc56.ph .gla.ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote:

On Sat, 18 Dec 2004, Henri Sivonen wrote:
In article <fg************ @hugin.webthing .com>,
ni**@hugin.webt hing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I
saw that (libxml2 accepts HTML as well as XML input).
A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so?

But does this matter, in the context of the original question?

Perhaps not. It was a new question in the spirit of "discussion
forum--not help desk". :-)
Surely, given any WWW-compatible HTML or XHTML data stream, one can
choose to convert any non-ascii coded character (or any selection of
non-ascii characters) to a unicode code point and thence into
&#bignumber; notation, purely at the character stream layer, without
parsing the rest of the material at all?

Yes, except comments change if they exist and contain non-ASCII.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #11

Bjoern Hoehrmann

* Alan J. Flavell wrote in comp.infosystem s.www.authoring.html:

I don't dispute that in theory you can produce counter-examples where
the simple method described above gives the wrong result, for the
reasons you gave; but I'm interested if a real-life example can be
produced where this would matter.

Consider a HTML document with

<style type="text/css">
q:lang(no) { quotes: "«" "»" '"' '"' }
</style>

or consider HTML documents with scripts such as those in

http://www.rfs.jp/sitebuilder/javascript/01/08.html
--
Björn Höhrmann · mailto:bj****@h oehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Jul 23 '05 #12

Alan J. Flavell

On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote:

Consider a HTML document with

<style type="text/css">
q:lang(no) { quotes: "«" "»" '"' '"' }
</style>

or consider HTML documents with scripts such as those in

http://www.rfs.jp/sitebuilder/javascript/01/08.html

OK, I concede.

Of course, if the target encoding was meant to be us-ascii with
&#bignumber; representations of non-ascii characters (which might have
been what the questioner had in mind, since I undestood the request to
be for &#bignumber; representation rather than actual utf-8-encoded
characters in the HTML part), then you'd need CSS-aware and
Javascript-aware converters to know how to represent those non-ascii
characters in their respective languages.

Indeed the W3C were wise in their XHTML documentation to recommend
moving those enclosures out into separate files rather than trying to
in-line them as CDATA ;-)

Jul 23 '05 #13

Nick Kew

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:

Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
A quick glance at the API docs suggested that the HTML API is similar
but separate from the XML API. Is it so?

Yes, that's a reasonably fair summary. The HTML parser is the XML
parser with tolerance of non-XML and knowledge of HTML4.
Is there an equivalent of SAX
filter or somesuch that would make HTML appear to the app as XHTML?
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction. HTML mode is also tolerant
of tag-soup, though not quite as forgiving as a typical browser.
There are a few bugs wrt the spec: most obviously, it only recognises
XML comment syntax (but then, so do the browsers).

As a corollary, you can use it to apply XML processing to HTML.
TagSoup on the Java side appears to the app as an XML parser parsing
XHTML.
I'm not familiar with that, but it's not uncommon.
Has anyone compared the tag slurping features of TagSoup and libxml2? I
Wonder which one is a better idea when writing in Python: using libxml2
with CPython or using TagSoup with Jython?

Couldn't tell you. But I'd venture a strong guess that libxml2 will be
not only a great deal faster than anything-java, but also no harder
and possibly easier to work with.
--
Nick Kew

Jul 23 '05 #14

Henri Sivonen

In article <fu************ @hugin.webthing .com>,
ni**@hugin.webt hing.com (Nick Kew) wrote:

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:
Indeed. I was on the point of suggesting AN XML processor until I saw
that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction.
Are the elements in the XHTML namespace or in no namespace? The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.
But I'd venture a strong guess that libxml2 will be
not only a great deal faster than anything-java, but also no harder
and possibly easier to work with.

I think I read somewhere that the libxml2 wrapper gives the Python side
UTF-8 byte strings instead of Python Unicode strings.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #15

Nick Kew

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:
>> Indeed. I was on the point of suggesting AN XML processor until I saw
>> that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction.

Are the elements in the XHTML namespace or in no namespace?

They're not namespaced. At least not in the SAX parse mode, which is
where I've investigated the issue. At least, my preliminary experiments
trying to use the HTML parser in SAX2 mode were not successful, which
is not to say I won't return to the issue.
The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.

Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The full capability is at best a pathological edge-case.

BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at
http://www.xml.com/pub/a/2004/12/15/...amespaces.html

--
Nick Kew

Jul 23 '05 #16

Henri Sivonen

In article <cq***********@ hugin.webthing. com>,
ni**@hugin.webt hing.com (Nick Kew) wrote:

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:
In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:

>> Indeed. I was on the point of suggesting AN XML processor until I saw
>> that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction.

Are the elements in the XHTML namespace or in no namespace?

They're not namespaced.

That's a pity. Of course, it's possible to write a filter that takes
SAX1 events, adds the namespacing and emits SAX2 events, but it is
uncool to have to implement stuff that a library should be able to do
out of the box.

The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.

Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?

The people who export from MS Office?

I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.

However, I think it doesn't make sense to write the app internals for
namespaceless HTML so that massive rework is needed for XHTML+FooML. It
makes more sense to write the app internals for namespaced compound
documents and to convert HTML to XHTML at parse time. Using an XML
parser is the right way to go for XHTML and XHTML+FooML.
BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at
http://www.xml.com/pub/a/2004/12/15/...amespaces.html

Interesting.

BTW, how do you reconcile the GPL and the Apache license?

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #17

Nick Kew

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:

Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The people who export from MS Office?

Good catch. I'd forgotten that one. Don't they try/claim to be XHTML?
I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.

It would presumably work by treating it as XHTML. Like XPath, XSLT,
etc, which do work fine with HTML and the libxml2 parser.

BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at
http://www.xml.com/pub/a/2004/12/15/...amespaces.html

Interesting.

BTW, how do you reconcile the GPL and the Apache license?

Why is that a problem? My work is GPL (if you want it free - dual
licensing available otherwise). Apache is ASF license. They are
distributed separately. Those Linux distros (and FreeBSD) that
package my GPL modules offer them to users as separate packages,
and don't have a problem with it. Even the fundamentalists at
Debian don't have a problem with it. Any more than they have a
problem distributing non-GPL apps like Apache to run on Linux itself.

--
Nick Kew

Jul 23 '05 #18

Henri Sivonen

In article <l8************ @hugin.webthing .com>,
ni**@hugin.webt hing.com (Nick Kew) wrote:

In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes:
Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The people who export from MS Office?

Good catch. I'd forgotten that one. Don't they try/claim to be XHTML?

I don't think so. It's more like HTML tag soup spiced up with colonified
names and XML "data islands".

I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.

It would presumably work by treating it as XHTML.

With namespaces in HTML I meant this kind of Microsoftism:

<HTML xmlns:k='urn:ke wl-schema-urn'>
<HEAD>
<TITLE>Test</TITLE>
<xml>
<k:foo>
<k:bar/>
</k:foo>
</xml>
</HEAD>
<BODY>
....
</BODY>
</HTML>

(I suppose Microsoft has defined how that is supposed to work. So saying
it isn't defined was not entirely accurate.)
Why is that a problem?
The FSF lists the Apache licenses 1.0, 1.1 and 2.0 as GPL-incompatible
free software licenses.

http://www.fsf.org/licenses/license-...atibleLicenses
Even the fundamentalists at Debian don't have a problem with it.
That's surprising. :-)
Any more than they have a
problem distributing non-GPL apps like Apache to run on Linux itself.

IIRC, Linus Torvalds declared an exception when the subject came up.

--
Henri Sivonen
hs******@iki.fi
http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html

Jul 23 '05 #19

Similar topics

2025

character reference encoding confusion, please help.

by: news.hunterlink.net.au | last post by:

(* note the escaped ampersand and the character reference have extra spaces to avoid being converted when viewed) I have a job that requires the following <ThisElement>Here is some text & a m p ; here is a & # x E 2 ; character</ThisElement> to end up as

.NET Framework

1170

Hey, who keeps chaning my escaped character!

by: Rob Morrison | last post by:

The sample below demonstates an issue that I cannot seem to workaround. I have an Url with a value that contains an ampersand. I have escaped the Url using both the hex value and it works fine when used in a href. But, if I pass the same Url to the open() function it unescapes my ampersand while leaving the other escaped untouched. This behavior is the same for both IE and Mozilla Firefox, I guess this is known behvoir unknown to me. ...

Javascript

3363

Character array initialization

by: Christian Kandeler | last post by:

Hi, if I want to store the string "123456" in a variable of type char, I can do it like this: char s = "123456"; Or like this: char s = { '1', '2', '3', '4', '5', '6', '\0' };

C / C++

96338

printing % with printf(), use of \ (escape) character

by: teachtiro | last post by:

Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %% should be used. Wouldn't it have been better (from design perspective) if the same escape character had been used in this case too. Forgive me for posting without verfying things with any standard compiler, i don't have the means for now.

C / C++

9647

Problem With String Containing JavaScript Escape Character

by: Jeff S | last post by:

In a VB.NET code behind module, I build a string for a link that points to a JavaScript function. The two lines of code below show what is relevant. PopupLink = "javascript:PopUpWindow(" & Chr(34) & PopUpWindowTitle & Chr(34) & ", " & Chr(34) & CurrentEventDetails & ")" strTemp += "<BR><A HREF='#' onClick='" & PopupLink & "'>" & EventName & "</A><BR>" The problem I have is that when the string variables or contain a string with an...

ASP.NET

18322

Escape Character \e does not work

by: pkaeowic | last post by:

I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN. Any idea if this is a bug? if (e.KeyChar == '\e') { this.Close(); }

C# / C Sharp

1872

python regex character group matches

by: christopher taylor | last post by:

hello python-list! the other day, i was trying to match unicode character sequences that looked like this: \\uAD0X... my issue, is that the pattern i used was returning:

Python

3078

Command line character problem

by: mdh | last post by:

Hi all, I have a file, whose path is: "/Users/m/k&R/test_file" How do I include the '&' in a string constant? ( I need this for the example on p162). I have tried to use the Hex notation x26, as in "/Users/m/k\x26R/test_file".

C / C++

9550

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10495

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10032

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

9085

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6811

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5469

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

5597

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4148

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2942

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

General