By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,614 Members | 1,674 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,614 IT Pros & Developers. It's quick & easy.

xml_parse and entities and quotes

P: n/a
Hello,
I have to parse an xml file which has entities like &8220; ”
and apostrophes (').
I have created a parser and I parse the xml but the xml_parse function
convert entities and apostrophes to question marks (?)

I am sure about that because I read the xml source in a string and I can see
what happens before and after xml_parse.

I am missing something?

the source is iso-8859-1 and is converted with utf8_encode before xml_parse
Feb 19 '07 #1
Share this Question
Share on Google+
5 Replies


P: n/a
I'm not exactly sure what are you trying to do, but maybe
"html_entity_decode" will help you, just look in the manual.
A.Martini rae:
Hello,
I have to parse an xml file which has entities like &8220; ”
and apostrophes (').
I have created a parser and I parse the xml but the xml_parse function
convert entities and apostrophes to question marks (?)

I am sure about that because I read the xml source in a string and I can see
what happens before and after xml_parse.

I am missing something?

the source is iso-8859-1 and is converted with utf8_encode before xml_parse
Feb 19 '07 #2

P: n/a
A.Martini wrote:
the source is iso-8859-1 and is converted with utf8_encode before
xml_parse
I'm pretty sure you're on the right track with it being a character
encoding issue.

However, if the source string only contains *entities* (e.g. “)
rather than true non-ASCII characters (e.g. “), then utf8_encode shouldn't
actually do anything with the source. (It doesn't need to, as ASCII is a
proper subset of both ISO-8859-1 and UTF-8.)

So it is likely that either:

1. xml_parse is actually converting “ =?
OR
2. xml_parse is correctly converting “ =“
but whatever you do with the result isn't Unicode-aware.

Number 2 is likely, as (until PHP 6 comes out) most native PHP functions
don't handle Unicode very well.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
Feb 19 '07 #3

P: n/a
Hello,

Thank you all.

Actually I solved my problems after some google code search about "&8230;"
I have found this function that definitely solve my problem.

http://www.martintod.org.uk/blog/wp-...ssentities.zip

The big problem was caused by those curly quotes derived by work cut&paste
in the rss.

$poshquotes =
array('’','’','‘','‘','“','”','&8212;','–','&823 0;');
$plebquotes = array("'","'","'","'",'"','"','-','-','...');

(warning above the curly quotes are not displayed properly, you shoud
download the code to see it properly)

Seems that some entities are not recognised by the parser and should be
translated into more simple (equivalent) characters.

Feb 19 '07 #4

P: n/a
A.Martini wrote:

Hello,

Analyzing more in depth the problem i discovered that I have to substitute
all the &#nnn; and &#nnnn; with their <![CDATA[&#nnn;]]and <
[CDATA[&#nnnn;]]so encoding will pass thru the parser properly and are
not substitued by ?.

Now the challenge is to find a regular expression to do the substitution of
all:

&#nnn; -<![CDATA[&#nnn;]]>

and

&#nnnn; -<![CDATA[&#nnnn;]]>
I have write a new function

function cdataize($content)
{
$content = preg_replace('/\&\#([0-9]+);/','<![CDATA[&#\\1;]]>',$content);
return($content);
}

This will solve the problem in a more elegant way :-)
Feb 19 '07 #5

P: n/a
A.Martini wrote:
I have write a new function

function cdataize($content)
{
$content = preg_replace('/\&\#([0-9]+);/','<![CDATA[&#\\1;]]>',$content);
return($content);
}
Entities can use hexadecimal notation too...

function cdataize($content)
{
$content = preg_replace('/\&\#([0-9]+);/','<![CDATA[&#\\1;]]>',$content);
$content = preg_replace('/\&\#x([0-9A-F]+);/i','<![CDATA[&#\\1;]]>',$content);
return($content);
}

There are also named entities, like "&emdash;", but I don't think RSS
defines any such entities apart from the five standard XML ones.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
Feb 20 '07 #6

This discussion thread is closed

Replies have been disabled for this discussion.