467,886 Members | 1,806 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 467,886 developers. It's quick & easy.

an idiot question about a disallowed entity


Can't get this RSS feed clean:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
Why is it dying?

Some users write posts in Microsoft Word, then copy and paste their
post to the web browser and paste it in and hit submit and create a
weblog entry. This is what I just did myself.

I've written a PHP function that I thought would clean this feed, it
goes through the whole feed one byte at a time, and makes sure every
byte has an ascii value between 32 and 126. I thought that might give
me some garbage characters but they'd all be safe for RSS.

No. The feed is still dying. How do I find out what entity is killing
it?

Oct 12 '05 #1
  • viewed: 5178
Share:
9 Replies
lk******@geocities.com wrote:

: Can't get this RSS feed clean:

: http://www.whatisliberalism.com/pdsFiles/page2533.xml
: Why is it dying?

: Some users write posts in Microsoft Word, then copy and paste their
: post to the web browser and paste it in and hit submit and create a
: weblog entry. This is what I just did myself.

: I've written a PHP function that I thought would clean this feed, it
: goes through the whole feed one byte at a time, and makes sure every
: byte has an ascii value between 32 and 126. I thought that might give
: me some garbage characters but they'd all be safe for RSS.

: No. The feed is still dying. How do I find out what entity is killing
: it?

First I would feed it through an xml validator. It should tell you where
the xml goes wrong.

It it fails that you know what's wrong. If it passes - well worry about
that after the first test.

--

This programmer available for rent.
Oct 12 '05 #2
Malcolm Dew-Jones (yf***@vtn1.victoria.tc.ca) wrote:
: lk******@geocities.com wrote:

: : Can't get this RSS feed clean:

: : http://www.whatisliberalism.com/pdsFiles/page2533.xml
: : Why is it dying?

: : Some users write posts in Microsoft Word, then copy and paste their
: : post to the web browser and paste it in and hit submit and create a
: : weblog entry. This is what I just did myself.

: : I've written a PHP function that I thought would clean this feed, it
: : goes through the whole feed one byte at a time, and makes sure every
: : byte has an ascii value between 32 and 126. I thought that might give
: : me some garbage characters but they'd all be safe for RSS.

: : No. The feed is still dying. How do I find out what entity is killing
: : it?

: First I would feed it through an xml validator. It should tell you where
: the xml goes wrong.

: It it fails that you know what's wrong. If it passes - well worry about
: that after the first test.

In fact I realized I had a validator in "easy reach" so I used it on the
above url. I got

XML error: undefined entity, at line 22, column 23535

Using my handy dandy editor, I have cut and pasted some text from around
the offending section.

<description>I've ...

that our activities as feminists &acirc;'' including the
^^^^^^^
ERROR

... of new ideas.</description>
You can see which entity is causing a problem. It fails on the first
error, so there could be other errors after that.
--

This programmer available for rent.
Oct 12 '05 #3
>First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.


That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?

And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be? What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?

Oct 12 '05 #4
lk******@geocities.com wrote:
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be?
Because nobody defined them for the XML-based language that you use.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Define them.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Oct 12 '05 #5
I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?

Oct 12 '05 #6
lk******@geocities.com wrote:
: I don't know how to define entity references for XML, nor am I aware if
: I'm allowed to add new definitions to RSS. XML is one of those things
: I've been hoping to study for awhile but have not yet had the chance.

: I'm wondering if there is a quick fix that will hold me till I have
: time to look at the issue in depth. If I write a little PHP script to
: strip out all HTML entity references, then the feed will work?

The quick fix for unrecognized entities is to escape them, so

&circ; should be escaped to become
&amp;circ;

The escaped data "&amp;circ;" will be unescaped back to the original
"circ;" if an xml program extracts the data from the feed.

Whether the "&circ;" will _display_ correctly will depend on the program
that extracts and/or displays the data. I.e. if you use an xml program to
extract the description data into a file, and then use a browser to view
the file, then the browser will display the correct symbol. On the other
hand if the browser itself is reading the rss feed directly then it may or
may not display the desired symbol - it might display the word "&circ;"
instead.

As for the "GI" error, I am not familiar with that, and I'm sorry but I
haven't examined your file to figure it out.

--

This programmer available for rent.
Oct 13 '05 #7
lk******@geocities.com wrote:
First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.
That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?


It means title was never declared in the DTD or Schema.
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?
It means acirc was never declared in the DTD.
I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML.
It is if you haven't declared it (with the exception of the five
which are assumed to pre-exist, but only when *not* using a DTD).
Why would it be?
Because that's what the rules say.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Convert them to actual characters (eg â for acirc) using the
declared character set of the document.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

Oct 13 '05 #8
lk******@geocities.com wrote:
I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?


If you can change the feed, you could define the entities in a document
type declaration:

<!DOCTYPE rss [
<!ENTITY acirc "â">
]>
<rss>
....
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Oct 13 '05 #9

Peter Flynn wrote:
lk******@geocities.com wrote:
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Convert them to actual characters (eg for acirc) using the
declared character set of the document.


I see. So if I say that the character encoding for the feed is UTF-8, I
look up what the equivalent of acirc is for UTF-8. That sounds like the
right long-term goal for me to aim for. Should be simple enough to look
up all the entity references on w3c and translate them all to UTF-8,
yes?

Oct 31 '05 #10

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

2 posts views Thread by Ed Dennison | last post: by
15 posts views Thread by Daniel Billingsley | last post: by
11 posts views Thread by Arsen Vladimirskiy | last post: by
12 posts views Thread by clintonG | last post: by
5 posts views Thread by Suresh | last post: by
63 posts views Thread by David Mathog | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.