473,783 Members | 2,418 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

xml parsing escape characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I only know a little bit of xml and I'm trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).

When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen( url)
xmldoc = minidom.parse(r esposta)
resposta.close( )

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt; DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
_______________ _______________ _______________ _______________ _

In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:

stringNode = xmldoc.childNod es[0]
print stringNode.toxm l()
dataSetNode = stringNode.chil dNodes[0]
numNos = len(dataSetNode .childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.chi ldNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml

(I'm almost sure there's a simpler way to do this...)
_______________ _______________ _______________ _______________ _

I don't get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439 </Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>

and the lines I posted work as intended.

I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.

I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("& lt;", "<")
but I had to convert the xml document to string and then I could not (or
don't know) how to convert it back to xml object.

How can I solve this? Please, explain it having in mind that I'm just
beggining with Xml and I'm not very experienced in Python, too.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7rzKHn4 UHCY8rB8RAhnlAK CYA6t0gd8rRDhIv Z5sdmNJlEPSeQCg teB3
XUtZ0JoHeTavBOC Yi6YYnNo=
=VORM
-----END PGP SIGNATURE-----
Jul 18 '05 #1
16 5998
Luis P. Mendes wrote:
I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt; DataSet&gt;
~ &lt;Order&gt ;
Most likely, this result is correct, and your document
really does contain

&lt;Order&gt ;

I don't get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
Most likely, your browser is incorrect (or atleast confusing), and
renders &lt; as "<", even though this is not markup.
I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.


Not sure what "this" is. AFAICT, everything works correctly.

Regards,
Martin
Jul 18 '05 #2
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt; DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

When I do:

print xmldoc.toxml()

it prints:
<?xml version="1.0" ?>
<string xmlns="http://www...">&lt;Dat aSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

_______________ _______________ _______________ _____________
with: stringNode = xmldoc.childNod es[0]
print stringNode.toxm l()
I get:
<string xmlns="http://www.......">&lt ;DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
_______________ _______________ _______________ _______________ __________

with: DataSetNode = stringNode.chil dNodes[0]
print DataSetNode.tox ml()

I get:

&lt;DataSet& gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;
_______________ _______________ _______________ _______________ ___-

so far so good, but when I issue the command:

print DataSetNode.chi ldNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?
Why doesn't it return:
&lt;Order&gt ;
&lt;Customer&gt ;439&lt;/Customer&gt;

&lt;/Order&gt;
??
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB76y3Hn4 UHCY8rB8RAvQsAK CFD/hps8ybQli8HAs3i SCvRjwqjACfS/12
5gctpB91S5cy299 e/TVLGQk=
=XR2a
-----END PGP SIGNATURE-----
Jul 18 '05 #3
Luis P. Mendes wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

this is the xml document:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt; DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>


This is an XML document containing a single tag, <string>, whose content is text containing
entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>, <Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to unescape the contents of the
<string> tag to be able to treat it as structured XML.

Kent
Jul 18 '05 #4
Kent Johnson wrote:
[...]
This is an XML document containing a single tag, <string>, whose content
is text containing entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>,
<Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to
unescape the contents of the <string> tag to be able to treat it as
structured XML.


The unescaping is usually done for you by the xml parser that you use.

--Irmen
Jul 18 '05 #5
Irmen de Jong wrote:
Kent Johnson wrote:
[...]
This is an XML document containing a single tag, <string>, whose
content is text containing entity-escaped XML.

This is *not* an XML document containing tags <DataSet>, <Order>,
<Customer>, etc.

All the behaviour you are seeing is a consequence of this. You need to
unescape the contents of the <string> tag to be able to treat it as
structured XML.

The unescaping is usually done for you by the xml parser that you use.


Yes, so if your XML contains for example
<stuff>&lt;no t a tag&gt;</stuff>

and you parse this and ask for the *text* content of the <stuff> tag, you will get the string
"<not a tag>"

but it's still *not* a tag. If you try to get child elements of the <stuff> element there will be none.

This is exactly the confusion the OP has.

--Irmen

Jul 18 '05 #6
Luis P. Mendes wrote:
with: DataSetNode = stringNode.chil dNodes[0]
print DataSetNode.tox ml()

I get:

&lt;DataSet& gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;

~ &lt;/Order&gt;
&lt;/DataSet&gt;
_______________ _______________ _______________ _______________ ___-

so far so good, but when I issue the command:

print DataSetNode.chi ldNodes[0]

I get:
IndexError: tuple index out of range

Why the error, and why does it return a tuple?


The DataSetNode has no children, because it is not
an Element node, but a Text node. In XML, an element
is denoted by

<DataSet>...</DataSet>

and *not* by

&lt;DataSet&gt; ...&lt;/DataSet&gt;

The latter is just a single string, represented
in XML as a Text node. It does not give you any
hierarchy whatsoever.

As a text node does not have any children, its
childNode members is a empty tuple; accessing
that tuple gives you an IndexError.

Regards,
Martin
Jul 18 '05 #7
Irmen de Jong wrote:
The unescaping is usually done for you by the xml parser that you use.


Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
&lt; and &gt;. The XML parser unescapes that as < and >. However, it
does not then consider the < and > as markup, and it shouldn't.

Regards,
Martin
Jul 18 '05 #8
Martin v. Löwis wrote:
Irmen de Jong wrote:
The unescaping is usually done for you by the xml parser that you use.

Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
&lt; and &gt;. The XML parser unescapes that as < and >. However, it
does not then consider the < and > as markup, and it shouldn't.


That's also what I said?

The unescaping of the XML entities in the contents of the OP's
<string> element is done for you by the parser,
so you will get a text node with the <,>,&,whateve r in there.
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.

--Irmen
Jul 18 '05 #9
Irmen de Jong wrote:
Usually, but not in this case. If you have a text that looks like
XML, and you want to put it into an XML element, the XML file uses
&lt; and &gt;. The XML parser unescapes that as < and >. However, it
does not then consider the < and > as markup, and it shouldn't.

That's also what I said?


You said it in response to
All the behaviour you are seeing is a consequence of this. You need
to unescape the contents of the <string> tag to be able to treat it
as structured XML.

In that context, I interpreted
The unescaping is usually done for you by the xml parser that you
use.


as "The parser should have done what you want; if the parser didn't,
that is is bug in the parser".
The OP probably wants to feed that to a new xml parser instance
to process it as markup.
Or perhaps the way the original XML document is constructed is
flawed.


Either of these, indeed - probably the latter.

Regards,
Martin
Jul 18 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
8563
by: BTnews | last post by:
Hi, Can anyone here point me at a definitive guide or tutorial about using escape characters when building SQL queries from user entered data? I'm especially interested in info on this in regard to Access databases and (classic) ASP. I've been writing ASP for just over a year now, and I've usually found very comprehensive answers to other problems on one of the many excellent website resources out there. The coverage of this...
28
2857
by: Fabian | last post by:
I use the following to parse the url var srch = window.location.search.substring(1); // then split srch at the ampersand: var parts = srch.split("&"); // write the parameters into the variables for(var i in parts) { var temp = parts.split("="); if (temp == "xx") { xx = 1 * temp; } if (temp == "yy") { yy = 1 * temp; }
2
43825
by: Matthew Wieder | last post by:
In my previous post, I asked about a routine which prepares a string for an XPath query by taking care of escape characters. Unable to find a list, I'm now wondering assumign I enclose the attribute value in quotes in my XPath query, what other escape characters need to be handled aside from a quotation mark? As I understand it, an apostrophe wouldn't be a problem since I'm enclosing the string in quotes. thanks!
7
96330
by: teachtiro | last post by:
Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %% should be used. Wouldn't it have been better (from design perspective) if the same escape character had been used in this case too. Forgive me for posting without verfying things with any standard compiler, i don't have the means for now.
4
7472
by: Guadala Harry | last post by:
I need to place the following into a string... How can I properly escape the % " / < and > characters? <table width="100%" border="0" cellspacing="0" cellpadding="4px" class="hfAll"></Table> Thanks.
3
2249
by: Guadala Harry | last post by:
I'd like to know the answer to the following question so I can know what to expect with regard to other similar uses of escape characters and strings. While everything works fine - I'd like to know specifically why: I am building a simple HTML table in my C# code-behind by concatenating strings that contain different parts of the table and table content... something like this: string myTable = "<table width=\"100%\" border=\"0\"...
15
18321
by: pkaeowic | last post by:
I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN. Any idea if this is a bug? if (e.KeyChar == '\e') { this.Close(); }
131
9282
by: Lawrence D'Oliveiro | last post by:
The "escape" function in the "cgi" module escapes characters with special meanings in HTML. The ones that need escaping are '<', '&' and '"'. However, cgi.escape only escapes the quote character if you pass a second argument of True (the default is False): 'the "quick" &amp; &lt;brown&gt; fox' 'the &quot;quick&quot; &amp; &lt;brown&gt; fox' This seems to me to be dumb. The default option should be the safe one: that is, escape _all_ the potentially troublesome...
0
1755
by: Marijn | last post by:
Hello, This is my first post to this forum, because until now Perl has been one of the most convenient and intuitive programming languages I know. However, now I have an issue that I cannot resolve. I want to pass a string as a command line argument, and (for the purposes of this question) print the string. This works fine, except for the fact that escape characters are not parsed. The code: $output1 = $ARGV; $output2 = "line 1\nline...
0
9643
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9480
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10313
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10081
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
9946
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
8968
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5378
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
1
4044
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
2875
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.