473,386 Members | 1,798 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,386 software developers and data experts.

an idiot question about a disallowed entity


Can't get this RSS feed clean:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
Why is it dying?

Some users write posts in Microsoft Word, then copy and paste their
post to the web browser and paste it in and hit submit and create a
weblog entry. This is what I just did myself.

I've written a PHP function that I thought would clean this feed, it
goes through the whole feed one byte at a time, and makes sure every
byte has an ascii value between 32 and 126. I thought that might give
me some garbage characters but they'd all be safe for RSS.

No. The feed is still dying. How do I find out what entity is killing
it?

Oct 12 '05 #1
9 5433
lk******@geocities.com wrote:

: Can't get this RSS feed clean:

: http://www.whatisliberalism.com/pdsFiles/page2533.xml
: Why is it dying?

: Some users write posts in Microsoft Word, then copy and paste their
: post to the web browser and paste it in and hit submit and create a
: weblog entry. This is what I just did myself.

: I've written a PHP function that I thought would clean this feed, it
: goes through the whole feed one byte at a time, and makes sure every
: byte has an ascii value between 32 and 126. I thought that might give
: me some garbage characters but they'd all be safe for RSS.

: No. The feed is still dying. How do I find out what entity is killing
: it?

First I would feed it through an xml validator. It should tell you where
the xml goes wrong.

It it fails that you know what's wrong. If it passes - well worry about
that after the first test.

--

This programmer available for rent.
Oct 12 '05 #2
Malcolm Dew-Jones (yf***@vtn1.victoria.tc.ca) wrote:
: lk******@geocities.com wrote:

: : Can't get this RSS feed clean:

: : http://www.whatisliberalism.com/pdsFiles/page2533.xml
: : Why is it dying?

: : Some users write posts in Microsoft Word, then copy and paste their
: : post to the web browser and paste it in and hit submit and create a
: : weblog entry. This is what I just did myself.

: : I've written a PHP function that I thought would clean this feed, it
: : goes through the whole feed one byte at a time, and makes sure every
: : byte has an ascii value between 32 and 126. I thought that might give
: : me some garbage characters but they'd all be safe for RSS.

: : No. The feed is still dying. How do I find out what entity is killing
: : it?

: First I would feed it through an xml validator. It should tell you where
: the xml goes wrong.

: It it fails that you know what's wrong. If it passes - well worry about
: that after the first test.

In fact I realized I had a validator in "easy reach" so I used it on the
above url. I got

XML error: undefined entity, at line 22, column 23535

Using my handy dandy editor, I have cut and pasted some text from around
the offending section.

<description>I've ...

that our activities as feminists &acirc;'' including the
^^^^^^^
ERROR

... of new ideas.</description>
You can see which entity is causing a problem. It fails on the first
error, so there could be other errors after that.
--

This programmer available for rent.
Oct 12 '05 #3
>First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.


That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?

And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be? What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?

Oct 12 '05 #4
lk******@geocities.com wrote:
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be?
Because nobody defined them for the XML-based language that you use.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Define them.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Oct 12 '05 #5
I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?

Oct 12 '05 #6
lk******@geocities.com wrote:
: I don't know how to define entity references for XML, nor am I aware if
: I'm allowed to add new definitions to RSS. XML is one of those things
: I've been hoping to study for awhile but have not yet had the chance.

: I'm wondering if there is a quick fix that will hold me till I have
: time to look at the issue in depth. If I write a little PHP script to
: strip out all HTML entity references, then the feed will work?

The quick fix for unrecognized entities is to escape them, so

&circ; should be escaped to become
&amp;circ;

The escaped data "&amp;circ;" will be unescaped back to the original
"circ;" if an xml program extracts the data from the feed.

Whether the "&circ;" will _display_ correctly will depend on the program
that extracts and/or displays the data. I.e. if you use an xml program to
extract the description data into a file, and then use a browser to view
the file, then the browser will display the correct symbol. On the other
hand if the browser itself is reading the rss feed directly then it may or
may not display the desired symbol - it might display the word "&circ;"
instead.

As for the "GI" error, I am not familiar with that, and I'm sorry but I
haven't examined your file to figure it out.

--

This programmer available for rent.
Oct 13 '05 #7
lk******@geocities.com wrote:
First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.
That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?


It means title was never declared in the DTD or Schema.
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?
It means acirc was never declared in the DTD.
I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML.
It is if you haven't declared it (with the exception of the five
which are assumed to pre-exist, but only when *not* using a DTD).
Why would it be?
Because that's what the rules say.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Convert them to actual characters (eg â for acirc) using the
declared character set of the document.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

Oct 13 '05 #8
lk******@geocities.com wrote:
I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?


If you can change the feed, you could define the entities in a document
type declaration:

<!DOCTYPE rss [
<!ENTITY acirc "â">
]>
<rss>
....
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Oct 13 '05 #9

Peter Flynn wrote:
lk******@geocities.com wrote:
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Convert them to actual characters (eg â for acirc) using the
declared character set of the document.


I see. So if I say that the character encoding for the feed is UTF-8, I
look up what the equivalent of acirc is for UTF-8. That sounds like the
right long-term goal for me to aim for. Should be simple enough to look
up all the entity references on w3c and translate them all to UTF-8,
yes?

Oct 31 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Ed Dennison | last post by:
I'm starting to look at DocBook-XML (not SGML) for producing a large documentation set. The hierarchy of DocBook elements for organizing the content is (more or less); set book part chapter...
4
by: terry | last post by:
could someone tell me how to add or remove entity to a xml file when i dim xmlentity as new xmlentity it's say it's sube new is private thks
15
by: Daniel Billingsley | last post by:
Speaking of trying to read deeply nested if-else blocks... I often find it's not always easy to tell one indent level from another (granted I keep my tab settings low so I'm not halfway across...
11
by: Arsen Vladimirskiy | last post by:
Hello, If I have a few simple classes to represent Entities such as Customers and Orders. What is the proper way to pass information to the Data Access Layer? 1) Pass the actual ENTITY to...
12
by: clintonG | last post by:
I can't tell you how frustrated I get when going to a web developer's website and observing he or she is an idiot that has not grasped the most fundamental element of usability: page title naming...
10
by: Jon Noring | last post by:
Out of curiosity, may a CDATA section appear within an attribute value with datatype CDATA? And if so, how about other attribute value datatypes which accept the XML markup characters? To me,...
5
by: Suresh | last post by:
Hi All I am designing DB2 database. I have some entities each has nearly 40-60 attributes. Each of these entity (table) have password, some other information as high security attribute. So...
63
by: David Mathog | last post by:
There have been a series of questions about directory operations, all of which have been answered with "there is no portable way to do this". This raises the perfectly reasonable question, why,...
0
by: Stodge | last post by:
Hi folks, new to Boost Python and struggling to build a prototype at work. I thought I'd start with a conceptual question to help clarify my understanding. I already have a basic prototype working...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.