473,480 Members | 4,939 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

xml_parse and entities and quotes

Hello,
I have to parse an xml file which has entities like &8220; ”
and apostrophes (').
I have created a parser and I parse the xml but the xml_parse function
convert entities and apostrophes to question marks (?)

I am sure about that because I read the xml source in a string and I can see
what happens before and after xml_parse.

I am missing something?

the source is iso-8859-1 and is converted with utf8_encode before xml_parse
Feb 19 '07 #1
5 2628
I'm not exactly sure what are you trying to do, but maybe
"html_entity_decode" will help you, just look in the manual.
A.Martini raše:
Hello,
I have to parse an xml file which has entities like &8220; ”
and apostrophes (').
I have created a parser and I parse the xml but the xml_parse function
convert entities and apostrophes to question marks (?)

I am sure about that because I read the xml source in a string and I can see
what happens before and after xml_parse.

I am missing something?

the source is iso-8859-1 and is converted with utf8_encode before xml_parse
Feb 19 '07 #2
A.Martini wrote:
the source is iso-8859-1 and is converted with utf8_encode before
xml_parse
I'm pretty sure you're on the right track with it being a character
encoding issue.

However, if the source string only contains *entities* (e.g. “)
rather than true non-ASCII characters (e.g. “), then utf8_encode shouldn't
actually do anything with the source. (It doesn't need to, as ASCII is a
proper subset of both ISO-8859-1 and UTF-8.)

So it is likely that either:

1. xml_parse is actually converting “ =?
OR
2. xml_parse is correctly converting “ =“
but whatever you do with the result isn't Unicode-aware.

Number 2 is likely, as (until PHP 6 comes out) most native PHP functions
don't handle Unicode very well.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
Feb 19 '07 #3
Hello,

Thank you all.

Actually I solved my problems after some google code search about "&8230;"
I have found this function that definitely solve my problem.

http://www.martintod.org.uk/blog/wp-...ssentities.zip

The big problem was caused by those curly quotes derived by work cut&paste
in the rss.

$poshquotes =
array('’','Â’','‘','‘','“','”','&8212;','–','&823 0;');
$plebquotes = array("'","'","'","'",'"','"','-','-','...');

(warning above the curly quotes are not displayed properly, you shoud
download the code to see it properly)

Seems that some entities are not recognised by the parser and should be
translated into more simple (equivalent) characters.

Feb 19 '07 #4
A.Martini wrote:

Hello,

Analyzing more in depth the problem i discovered that I have to substitute
all the &#nnn; and &#nnnn; with their <![CDATA[&#nnn;]]and <
[CDATA[&#nnnn;]]so encoding will pass thru the parser properly and are
not substitued by ?.

Now the challenge is to find a regular expression to do the substitution of
all:

&#nnn; -<![CDATA[&#nnn;]]>

and

&#nnnn; -<![CDATA[&#nnnn;]]>
I have write a new function

function cdataize($content)
{
$content = preg_replace('/\&\#([0-9]+);/','<![CDATA[&#\\1;]]>',$content);
return($content);
}

This will solve the problem in a more elegant way :-)
Feb 19 '07 #5
A.Martini wrote:
I have write a new function

function cdataize($content)
{
$content = preg_replace('/\&\#([0-9]+);/','<![CDATA[&#\\1;]]>',$content);
return($content);
}
Entities can use hexadecimal notation too...

function cdataize($content)
{
$content = preg_replace('/\&\#([0-9]+);/','<![CDATA[&#\\1;]]>',$content);
$content = preg_replace('/\&\#x([0-9A-F]+);/i','<![CDATA[&#\\1;]]>',$content);
return($content);
}

There are also named entities, like "&emdash;", but I don't think RSS
defines any such entities apart from the five standard XML ones.

--
Toby A Inkster BSc (Hons) ARCS
Contact Me ~ http://tobyinkster.co.uk/contact
Geek of ~ HTML/SQL/Perl/PHP/Python*/Apache/Linux

* = I'm getting there!
Feb 20 '07 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1815
by: Rutger Claes | last post by:
How do I stop the SAX parser from replacing entities such as &#xE3; and &#xA9;. When I use &gt; and a default_handler, the default_handler is called and I can just print the data. Entities as &#xA9;...
2
1787
by: Ken Fine | last post by:
I'm using XMLHTTP to screen-scrape many thousands of pages of content as part of a data-structuring project. One issue that I'm running into is that some entities such as curly quotes and curly...
1
1427
by: Soren Kuula | last post by:
Hi, I'm writing some DocBook with lots of math formulae. For every mathematical expression in my document, I have to enclose it in : <math xmlns="http://www.w3.org/1998/Math/MathML"> blah...
5
2882
by: ion | last post by:
Hi! I'm trying to do the simplest thing in the world, that is --> replace all characters with entities I saw the clever use of a string-replace template at...
1
1812
by: Tony | last post by:
I have been using TinyMCE as a WYSIWYG editor for getting content into a database and then exporting that data into an XML format to redender in flash using CDATA. The problem is that I didn't...
0
1518
by: CptDondo | last post by:
I have a website that is generated on the fly from XML templates. I've been having some performance issues (pages taking 6 seconds or more to be generated) and profiled the php code. (This is an...
1
1794
by: cyberdog | last post by:
xml_parse in php returns this error: junk after document element. The xml document is properly formatted and the error si returned when the cursor reaches the last character ( ">") . Did anybody...
5
4074
by: panky.tiwari | last post by:
Hi, I have an XML File while I am parsing through Expat and in Chunks. The code snippet I am using XML_Parser p = XML_ParserCreate(NULL); if(!p) { print_k("Unable to create Parser\n");...
0
1768
by: kjs1028x | last post by:
Hi! I am using Expat to parse XML files. I have implemented start/end/data handlers for my parser. My parser is able to parse XML files, but it does not return after it has completed parsing. When...
0
7040
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
1
6736
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
6908
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
4478
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
2994
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...
0
2980
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1299
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
561
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
178
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.