473,809 Members | 2,620 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

an idiot question about a disallowed entity


Can't get this RSS feed clean:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
Why is it dying?

Some users write posts in Microsoft Word, then copy and paste their
post to the web browser and paste it in and hit submit and create a
weblog entry. This is what I just did myself.

I've written a PHP function that I thought would clean this feed, it
goes through the whole feed one byte at a time, and makes sure every
byte has an ascii value between 32 and 126. I thought that might give
me some garbage characters but they'd all be safe for RSS.

No. The feed is still dying. How do I find out what entity is killing
it?

Oct 12 '05 #1
9 5451
lk******@geocit ies.com wrote:

: Can't get this RSS feed clean:

: http://www.whatisliberalism.com/pdsFiles/page2533.xml
: Why is it dying?

: Some users write posts in Microsoft Word, then copy and paste their
: post to the web browser and paste it in and hit submit and create a
: weblog entry. This is what I just did myself.

: I've written a PHP function that I thought would clean this feed, it
: goes through the whole feed one byte at a time, and makes sure every
: byte has an ascii value between 32 and 126. I thought that might give
: me some garbage characters but they'd all be safe for RSS.

: No. The feed is still dying. How do I find out what entity is killing
: it?

First I would feed it through an xml validator. It should tell you where
the xml goes wrong.

It it fails that you know what's wrong. If it passes - well worry about
that after the first test.

--

This programmer available for rent.
Oct 12 '05 #2
Malcolm Dew-Jones (yf***@vtn1.vic toria.tc.ca) wrote:
: lk******@geocit ies.com wrote:

: : Can't get this RSS feed clean:

: : http://www.whatisliberalism.com/pdsFiles/page2533.xml
: : Why is it dying?

: : Some users write posts in Microsoft Word, then copy and paste their
: : post to the web browser and paste it in and hit submit and create a
: : weblog entry. This is what I just did myself.

: : I've written a PHP function that I thought would clean this feed, it
: : goes through the whole feed one byte at a time, and makes sure every
: : byte has an ascii value between 32 and 126. I thought that might give
: : me some garbage characters but they'd all be safe for RSS.

: : No. The feed is still dying. How do I find out what entity is killing
: : it?

: First I would feed it through an xml validator. It should tell you where
: the xml goes wrong.

: It it fails that you know what's wrong. If it passes - well worry about
: that after the first test.

In fact I realized I had a validator in "easy reach" so I used it on the
above url. I got

XML error: undefined entity, at line 22, column 23535

Using my handy dandy editor, I have cut and pasted some text from around
the offending section.

<description>I' ve ...

that our activities as feminists &acirc;'' including the
^^^^^^^
ERROR

... of new ideas.</description>
You can see which entity is causing a problem. It fails on the first
error, so there could be other errors after that.
--

This programmer available for rent.
Oct 12 '05 #3
>First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.


That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?

And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be? What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?

Oct 12 '05 #4
lk******@geocit ies.com wrote:
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?

I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML. Why
would it be?
Because nobody defined them for the XML-based language that you use.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Define them.
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Oct 12 '05 #5
I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?

Oct 12 '05 #6
lk******@geocit ies.com wrote:
: I don't know how to define entity references for XML, nor am I aware if
: I'm allowed to add new definitions to RSS. XML is one of those things
: I've been hoping to study for awhile but have not yet had the chance.

: I'm wondering if there is a quick fix that will hold me till I have
: time to look at the issue in depth. If I write a little PHP script to
: strip out all HTML entity references, then the feed will work?

The quick fix for unrecognized entities is to escape them, so

&circ; should be escaped to become
&amp;circ;

The escaped data "&amp;circ; " will be unescaped back to the original
"circ;" if an xml program extracts the data from the feed.

Whether the "&circ;" will _display_ correctly will depend on the program
that extracts and/or displays the data. I.e. if you use an xml program to
extract the description data into a file, and then use a browser to view
the file, then the browser will display the correct symbol. On the other
hand if the browser itself is reading the rss feed directly then it may or
may not display the desired symbol - it might display the word "&circ;"
instead.

As for the "GI" error, I am not familiar with that, and I'm sorry but I
haven't examined your file to figure it out.

--

This programmer available for rent.
Oct 13 '05 #7
lk******@geocit ies.com wrote:
First I would feed it through an xml validator. It should tell you where
the xml goes wrong.
It it fails that you know what's wrong. If it passes - well worry about
that after the first test.
That was a very good idea. I got a very large number of errors. You can
see them if you go here:

http://www.stg.brown.edu/service/xmlvalid/

and type in this address to the URI validation field:

http://www.whatisliberalism.com/pdsFiles/page2533.xml
I was left wondering what some of the errors meant. What is " error
(1103): end tag uses GI for an undeclared element: title " mean?


It means title was never declared in the DTD or Schema.
And what does " error (1012): reference to undeclared entity:
&acirc; " mean?
It means acirc was never declared in the DTD.
I'm confused by the last error. I don't know much about XML, but I
didn't think that an HTML entity reference was invalid in XML.
It is if you haven't declared it (with the exception of the five
which are assumed to pre-exist, but only when *not* using a DTD).
Why would it be?
Because that's what the rules say.
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Convert them to actual characters (eg â for acirc) using the
declared character set of the document.

///Peter
--
XML FAQ: http://xml.silmaril.ie/

Oct 13 '05 #8
lk******@geocit ies.com wrote:
I don't know how to define entity references for XML, nor am I aware if
I'm allowed to add new definitions to RSS. XML is one of those things
I've been hoping to study for awhile but have not yet had the chance.

I'm wondering if there is a quick fix that will hold me till I have
time to look at the issue in depth. If I write a little PHP script to
strip out all HTML entity references, then the feed will work?


If you can change the feed, you could define the entities in a document
type declaration:

<!DOCTYPE rss [
<!ENTITY acirc "â">
]>
<rss>
....
--
Johannes Koch
In te domine speravi; non confundar in aeternum.
(Te Deum, 4th cent.)
Oct 13 '05 #9

Peter Flynn wrote:
lk******@geocit ies.com wrote:
What's the easiest way to sanitize HTML entity references
so that XML won't choke on them?


Convert them to actual characters (eg â for acirc) using the
declared character set of the document.


I see. So if I say that the character encoding for the feed is UTF-8, I
look up what the equivalent of acirc is for UTF-8. That sounds like the
right long-term goal for me to aim for. Should be simple enough to look
up all the entity references on w3c and translate them all to UTF-8,
yes?

Oct 31 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
3162
by: Ed Dennison | last post by:
I'm starting to look at DocBook-XML (not SGML) for producing a large documentation set. The hierarchy of DocBook elements for organizing the content is (more or less); set book part chapter sect1 sect2
4
7054
by: terry | last post by:
could someone tell me how to add or remove entity to a xml file when i dim xmlentity as new xmlentity it's say it's sube new is private thks
15
1551
by: Daniel Billingsley | last post by:
Speaking of trying to read deeply nested if-else blocks... I often find it's not always easy to tell one indent level from another (granted I keep my tab settings low so I'm not halfway across the page by the 3rd level), and I find myself doing things like this to help me keep it straight: class MyClass { public void SomeMethod()
11
3190
by: Arsen Vladimirskiy | last post by:
Hello, If I have a few simple classes to represent Entities such as Customers and Orders. What is the proper way to pass information to the Data Access Layer? 1) Pass the actual ENTITY to the Data Access Layer method -or- 2) Pass some kind of a unique id to the Data Access Layer method
12
1864
by: clintonG | last post by:
I can't tell you how frustrated I get when going to a web developer's website and observing he or she is an idiot that has not grasped the most fundamental element of usability: page title naming conventions. 1.) You know you are at an idiot's website when there is no page title. Listen up idiot. Give every page a name using the HTML <title> element. 2.) When naming your page do not put the name of your website or your company 'after'...
10
42972
by: Jon Noring | last post by:
Out of curiosity, may a CDATA section appear within an attribute value with datatype CDATA? And if so, how about other attribute value datatypes which accept the XML markup characters? To me, the XML specification seems a little ambiguous on this, so I defer to the XML authorities. Refer to sections 2.4 and 2.7 (it all hinges on if CDATA attribute values are part of markup or not.) Thanks.
5
1593
by: Suresh | last post by:
Hi All I am designing DB2 database. I have some entities each has nearly 40-60 attributes. Each of these entity (table) have password, some other information as high security attribute. So should i create new entity which hold password data for all entity or should I place password data in respective entity.In each case i will encrypt password. in both cases what will be effect with respect to performance and security. Each entity...
63
3494
by: David Mathog | last post by:
There have been a series of questions about directory operations, all of which have been answered with "there is no portable way to do this". This raises the perfectly reasonable question, why, in this day and age, does the C standard have no abstract and portable method for dealing with directories? It doesn't seem like a particularly difficult problem. For instance, this int show_current_directory(struct DIRSTRUCT *current_directory);
0
1313
by: Stodge | last post by:
Hi folks, new to Boost Python and struggling to build a prototype at work. I thought I'd start with a conceptual question to help clarify my understanding. I already have a basic prototype working nicely but I'm having a few issues, which I may post about later. A brief functional rundown of what I'm trying to prototype. Hopefully my explanation doesn't get too confusing! I'm embedding a python module into an application; Python will...
0
10640
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10376
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
10387
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10120
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7662
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6881
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5550
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
0
5689
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
3
3015
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.