Hello,
I'm looking for a program that converts characters of different
encodings (such as EUC-JP, Big5, GB-18030, etc.) into HTML ampersand
escape sequences. Anybody knows where I can find one?
thx.
Jul 23 '05
18 14751
In article <Pi************ *************** ****@ppepc56.ph .gla.ac.uk>,
"Alan J. Flavell" <fl*****@ph.gla .ac.uk> wrote: On Sat, 18 Dec 2004, Henri Sivonen wrote:
In article <fg************ @hugin.webthing .com>, ni**@hugin.webt hing.com (Nick Kew) wrote:
Indeed. I was on the point of suggesting AN XML processor until I saw that (libxml2 accepts HTML as well as XML input). A quick glance at the API docs suggested that the HTML API is similar but separate from the XML API. Is it so?
But does this matter, in the context of the original question?
Perhaps not. It was a new question in the spirit of "discussion
forum--not help desk". :-)
Surely, given any WWW-compatible HTML or XHTML data stream, one can choose to convert any non-ascii coded character (or any selection of non-ascii characters) to a unicode code point and thence into &#bignumber; notation, purely at the character stream layer, without parsing the rest of the material at all?
Yes, except comments change if they exist and contain non-ASCII.
--
Henri Sivonen hs******@iki.fi http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
On Sat, 18 Dec 2004, Bjoern Hoehrmann wrote: Consider a HTML document with
<style type="text/css"> q:lang(no) { quotes: "«" "»" '"' '"' } </style>
or consider HTML documents with scripts such as those in
http://www.rfs.jp/sitebuilder/javascript/01/08.html
OK, I concede.
Of course, if the target encoding was meant to be us-ascii with
&#bignumber; representations of non-ascii characters (which might have
been what the questioner had in mind, since I undestood the request to
be for &#bignumber; representation rather than actual utf-8-encoded
characters in the HTML part), then you'd need CSS-aware and
Javascript-aware converters to know how to represent those non-ascii
characters in their respective languages.
Indeed the W3C were wise in their XHTML documentation to recommend
moving those enclosures out into separate files rather than trying to
in-line them as CDATA ;-)
In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes: Indeed. I was on the point of suggesting AN XML processor until I saw that (libxml2 accepts HTML as well as XML input). A quick glance at the API docs suggested that the HTML API is similar but separate from the XML API. Is it so?
Yes, that's a reasonably fair summary. The HTML parser is the XML
parser with tolerance of non-XML and knowledge of HTML4.
Is there an equivalent of SAX filter or somesuch that would make HTML appear to the app as XHTML?
The HTML parser gives you either SAX or DOM, and will process either
HTML or XHTML input without distinction. HTML mode is also tolerant
of tag-soup, though not quite as forgiving as a typical browser.
There are a few bugs wrt the spec: most obviously, it only recognises
XML comment syntax (but then, so do the browsers).
As a corollary, you can use it to apply XML processing to HTML.
TagSoup on the Java side appears to the app as an XML parser parsing XHTML.
I'm not familiar with that, but it's not uncommon.
Has anyone compared the tag slurping features of TagSoup and libxml2? I Wonder which one is a better idea when writing in Python: using libxml2 with CPython or using TagSoup with Jython?
Couldn't tell you. But I'd venture a strong guess that libxml2 will be
not only a great deal faster than anything-java, but also no harder
and possibly easier to work with.
--
Nick Kew
In article <fu************ @hugin.webthing .com>, ni**@hugin.webt hing.com (Nick Kew) wrote: In article <hs************ *************** *@news.dnainter net.net>, Henri Sivonen <hs******@iki.f i> writes:
Indeed. I was on the point of suggesting AN XML processor until I saw that (libxml2 accepts HTML as well as XML input).
The HTML parser gives you either SAX or DOM, and will process either HTML or XHTML input without distinction.
Are the elements in the XHTML namespace or in no namespace? The good
thing about TagSoup is that it allows the app internals to be written
for XHTML, so the same app internals work for HTML, XHTML *and*
XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is
left on the parsing level and not carried over to higher levels as in
browsers.
But I'd venture a strong guess that libxml2 will be not only a great deal faster than anything-java, but also no harder and possibly easier to work with.
I think I read somewhere that the libxml2 wrapper gives the Python side
UTF-8 byte strings instead of Python Unicode strings.
--
Henri Sivonen hs******@iki.fi http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes: In article <hs************ *************** *@news.dnainter net.net>, Henri Sivonen <hs******@iki.f i> writes:
>> Indeed. I was on the point of suggesting AN XML processor until I saw >> that (libxml2 accepts HTML as well as XML input). The HTML parser gives you either SAX or DOM, and will process either HTML or XHTML input without distinction.
Are the elements in the XHTML namespace or in no namespace?
They're not namespaced. At least not in the SAX parse mode, which is
where I've investigated the issue. At least, my preliminary experiments
trying to use the HTML parser in SAX2 mode were not successful, which
is not to say I won't return to the issue.
The good thing about TagSoup is that it allows the app internals to be written for XHTML, so the same app internals work for HTML, XHTML *and* XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is left on the parsing level and not carried over to higher levels as in browsers.
Watch this space. That's what I'd like mod_publisher to do. OTOH,
how many people mix HTML (no X) with other namespaces in real life?
The full capability is at best a pathological edge-case.
BTW, if you're interested in namespace processing on the Web,
may I refer you to my recently-published article at http://www.xml.com/pub/a/2004/12/15/...amespaces.html
--
Nick Kew
In article <cq***********@ hugin.webthing. com>, ni**@hugin.webt hing.com (Nick Kew) wrote: In article <hs************ *************** *@news.dnainter net.net>, Henri Sivonen <hs******@iki.f i> writes:
In article <hs************ *************** *@news.dnainter net.net>, Henri Sivonen <hs******@iki.f i> writes:
>> Indeed. I was on the point of suggesting AN XML processor until I saw >> that (libxml2 accepts HTML as well as XML input). The HTML parser gives you either SAX or DOM, and will process either HTML or XHTML input without distinction.
Are the elements in the XHTML namespace or in no namespace?
They're not namespaced.
That's a pity. Of course, it's possible to write a filter that takes
SAX1 events, adds the namespacing and emits SAX2 events, but it is
uncool to have to implement stuff that a library should be able to do
out of the box. The good thing about TagSoup is that it allows the app internals to be written for XHTML, so the same app internals work for HTML, XHTML *and* XHTML+FooML (using an XML parser). That is, the HTML/XHTML difference is left on the parsing level and not carried over to higher levels as in browsers.
Watch this space. That's what I'd like mod_publisher to do. OTOH, how many people mix HTML (no X) with other namespaces in real life?
The people who export from MS Office?
I was not suggesting that namespaces in HTML should be supported. How
that would work isn't even defined.
However, I think it doesn't make sense to write the app internals for
namespaceless HTML so that massive rework is needed for XHTML+FooML. It
makes more sense to write the app internals for namespaced compound
documents and to convert HTML to XHTML at parse time. Using an XML
parser is the right way to go for XHTML and XHTML+FooML.
BTW, if you're interested in namespace processing on the Web, may I refer you to my recently-published article at http://www.xml.com/pub/a/2004/12/15/...amespaces.html
Interesting.
BTW, how do you reconcile the GPL and the Apache license?
--
Henri Sivonen hs******@iki.fi http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html
In article <hs************ *************** *@news.dnainter net.net>,
Henri Sivonen <hs******@iki.f i> writes: Watch this space. That's what I'd like mod_publisher to do. OTOH, how many people mix HTML (no X) with other namespaces in real life? The people who export from MS Office?
Good catch. I'd forgotten that one. Don't they try/claim to be XHTML?
I was not suggesting that namespaces in HTML should be supported. How that would work isn't even defined.
It would presumably work by treating it as XHTML. Like XPath, XSLT,
etc, which do work fine with HTML and the libxml2 parser. BTW, if you're interested in namespace processing on the Web, may I refer you to my recently-published article at http://www.xml.com/pub/a/2004/12/15/...amespaces.html
Interesting.
BTW, how do you reconcile the GPL and the Apache license?
Why is that a problem? My work is GPL (if you want it free - dual
licensing available otherwise). Apache is ASF license. They are
distributed separately. Those Linux distros (and FreeBSD) that
package my GPL modules offer them to users as separate packages,
and don't have a problem with it. Even the fundamentalists at
Debian don't have a problem with it. Any more than they have a
problem distributing non-GPL apps like Apache to run on Linux itself.
--
Nick Kew
In article <l8************ @hugin.webthing .com>, ni**@hugin.webt hing.com (Nick Kew) wrote: In article <hs************ *************** *@news.dnainter net.net>, Henri Sivonen <hs******@iki.f i> writes:
Watch this space. That's what I'd like mod_publisher to do. OTOH, how many people mix HTML (no X) with other namespaces in real life? The people who export from MS Office?
Good catch. I'd forgotten that one. Don't they try/claim to be XHTML?
I don't think so. It's more like HTML tag soup spiced up with colonified
names and XML "data islands". I was not suggesting that namespaces in HTML should be supported. How that would work isn't even defined.
It would presumably work by treating it as XHTML.
With namespaces in HTML I meant this kind of Microsoftism:
<HTML xmlns:k='urn:ke wl-schema-urn'>
<HEAD>
<TITLE>Test</TITLE>
<xml>
<k:foo>
<k:bar/>
</k:foo>
</xml>
</HEAD>
<BODY>
....
</BODY>
</HTML>
(I suppose Microsoft has defined how that is supposed to work. So saying
it isn't defined was not entirely accurate.)
Why is that a problem?
The FSF lists the Apache licenses 1.0, 1.1 and 2.0 as GPL-incompatible
free software licenses. http://www.fsf.org/licenses/license-...atibleLicenses
Even the fundamentalists at Debian don't have a problem with it.
That's surprising. :-)
Any more than they have a problem distributing non-GPL apps like Apache to run on Linux itself.
IIRC, Linus Torvalds declared an exception when the subject came up.
--
Henri Sivonen hs******@iki.fi http://iki.fi/hsivonen/
Mozilla Web Author FAQ: http://mozilla.org/docs/web-developer/faq.html This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics |
by: news.hunterlink.net.au |
last post by:
(* note the escaped ampersand and the character reference have extra spaces
to avoid being converted when viewed)
I have a job that requires the following
<ThisElement>Here is some text & a m p ; here is a & # x E 2 ;
character</ThisElement>
to end up as
|
by: Rob Morrison |
last post by:
The sample below demonstates an issue that I cannot seem to workaround. I
have an Url with a value that contains an ampersand. I have escaped the Url
using both the hex value and it works fine when used in a href. But, if I
pass the same Url to the open() function it unescapes my ampersand while
leaving the other escaped untouched. This behavior is the same for both IE
and Mozilla Firefox, I guess this is known behvoir unknown to me.
...
|
by: Christian Kandeler |
last post by:
Hi,
if I want to store the string "123456" in a variable of type char, I can
do it like this:
char s = "123456";
Or like this:
char s = { '1', '2', '3', '4', '5', '6', '\0' };
|
by: teachtiro |
last post by:
Hi,
'C' says \ is the escape character to be used when characters are
to be interpreted in an uncommon sense, e.g. \t usage in printf(),
but for printing % through printf(), i have read that %% should be used.
Wouldn't it have been better (from design perspective) if the same
escape character had been used in this case too.
Forgive me for posting without verfying things with any standard
compiler, i don't have the means for now.
|
by: Jeff S |
last post by:
In a VB.NET code behind module, I build a string for a link that points to a
JavaScript function. The two lines of code below show what is relevant.
PopupLink = "javascript:PopUpWindow(" & Chr(34) & PopUpWindowTitle & Chr(34)
& ", " & Chr(34) & CurrentEventDetails & ")"
strTemp += "<BR><A HREF='#' onClick='" & PopupLink & "'>" & EventName &
"</A><BR>"
The problem I have is that when the string variables or
contain a string with an...
| |
by: pkaeowic |
last post by:
I am having a problem with the "escape" character \e. This code is in my
Windows form KeyPress event. The compiler gives me "unrecognized escape
sequence" even though this is documented in MSDN. Any idea if this is a bug?
if (e.KeyChar == '\e')
{
this.Close();
}
|
by: christopher taylor |
last post by:
hello python-list!
the other day, i was trying to match unicode character sequences that
looked like this:
\\uAD0X...
my issue, is that the pattern i used was returning:
|
by: mdh |
last post by:
Hi all,
I have a file, whose path is:
"/Users/m/k&R/test_file"
How do I include the '&' in a string constant? ( I need this for the
example on p162). I have tried to use the Hex notation x26, as in
"/Users/m/k\x26R/test_file".
|
by: Hystou |
last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it.
First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed.
This is as boiled down as I can make it.
Here is my compilation command:
g++-12 -std=c++20 -Wnarrowing bit_field.cpp
Here is the code in...
|
by: tracyyun |
last post by:
Dear forum friends,
With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
| |
by: agi2029 |
last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own....
Now, this would greatly impact the work of software developers. The idea...
|
by: conductexam |
last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one.
At the time of converting from word file to html my equations which are in the word document file was convert into image.
Globals.ThisAddIn.Application.ActiveDocument.Select();...
|
by: TSSRALBI |
last post by:
Hello
I'm a network technician in training and I need your help.
I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs.
The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols.
I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
|
by: adsilva |
last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
|
by: 6302768590 |
last post by:
Hai team
i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
|
by: bsmnconsultancy |
last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...
| |