473,794 Members | 2,729 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

xml parsing escape characters

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,

I only know a little bit of xml and I'm trying to parse a xml document
in order to save its elements in a file (dictionaries inside a list).

When I access a url from python 2.3.3 running in Linux with the
following lines:
resposta = urllib.urlopen( url)
xmldoc = minidom.parse(r esposta)
resposta.close( )

I get the following result:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://www......">&lt; DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;
(... others ...)
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
_______________ _______________ _______________ _______________ _

In the lines below, I try to get all the child nodes from string, first
by counting them, and then ignoring the /n ones:

stringNode = xmldoc.childNod es[0]
print stringNode.toxm l()
dataSetNode = stringNode.chil dNodes[0]
numNos = len(dataSetNode .childNodes)
todosNos={}
for no in range(numNos):
todosNos[no] = dataSetNode.chi ldNodes[no].toxml()
posicaoXml = [no for no in todosNos.keys() if len(todosNos[no])>4]
print posicaoXml

(I'm almost sure there's a simpler way to do this...)
_______________ _______________ _______________ _______________ _

I don't get any elements. But, if I access the same url via a browser,
the result in the browser window is something like:

<string xmlns="http://www......">
~ <DataSet>
~ <Order>
~ <Customer>439 </Customer>
(... others ...)
~ </Order>
~ </DataSet>
</string>

and the lines I posted work as intended.

I already browsed the web, I know it's about the escape characters, but
I didn't find a simple solution for this.

I tried to use LL2XML.py and unescape function with a simple replace
text = text.replace("& lt;", "<")
but I had to convert the xml document to string and then I could not (or
don't know) how to convert it back to xml object.

How can I solve this? Please, explain it having in mind that I'm just
beggining with Xml and I'm not very experienced in Python, too.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB7rzKHn4 UHCY8rB8RAhnlAK CYA6t0gd8rRDhIv Z5sdmNJlEPSeQCg teB3
XUtZ0JoHeTavBOC Yi6YYnNo=
=VORM
-----END PGP SIGNATURE-----
Jul 18 '05
16 6000
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I would like to thank everyone for your answers, but I'm not seeing the
light yet!

When I access the url via the Firefox browser and look into the source
code, I also get:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http.... ............">& lt;DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>

should I take the contents of the string tag that is text and replace
all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
how to do it?

or should I use another parser that accomplishes the task with no need
to replace the escaped characters?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8AIQHn4 UHCY8rB8RAuw8AJ 9ZMQ8P3c7wXD1zV Ld2fe7MktMQwwCf XAND
EPpY1w2a3ix2s2v WRlzZ43U=
=bJQV
-----END PGP SIGNATURE-----
Jul 18 '05 #11
Luis P. Mendes wrote:
When I access the url via the Firefox browser and look into the source
code, I also get:

<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http.... ............">& lt;DataSet&gt;
~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt;
~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>
Please do try to understand what you are seeing. This is crucial for
understanding what happens.

You may have the understanding that XML can be represented as a tree.
This would be good - if not, please read a book that explains why
XML can be considered as a tree.

In the tree, you have inner nodes, and leaf nodes. For example,
the document

<a>
<b>Hello</b>
<c>World</c>
</a>

has 5 nodes (ignoring whitespace content):

Element:a ---- Element:b ---- Text:"Hello"
|
\-- Element:c ---- Text:"World"

So the leaf nodes are typically Text nodes (unless you
have an empty element). Your document has this structure:

Element:string ---- Text:"""<DataSe t>
<Order>
<Customer>439 </Customer>
</Order>
</DataSet>"""

So the ***TEXT*** contains the letter "<", just like it contains
the letters "O" and "r". There IS no element Order in your document,
no matter how hard you look.

If you want a DataSet *element* in your document, it should
read

<string xmlns="...">
<DataSet>
<Order>
<Customer>439 </Customer>
</Order
</DataSet>
</string>

As this is the document you apparently want to process, complain
to whoever gave you that other document.
should I take the contents of the string tag that is text and replace
all '&lt' with '<' and '&gt' with '>' and then read it with xml.minidom?
No. We still don't know what you want to achieve, so it is difficult to
advise you what to do. My best advise is that whoever generates the XML
document should fix it.
or should I use another parser that accomplishes the task with no need
to replace the escaped characters?


No. The parser is working correctly.

The document you got can also be interpreted as containing another
XML document as a text. This is evil, but apparently people are doing
it, anyway. If you really want that embedded document, you need
first to extract it.

To see what I mean, do

print DataSetNode.dat a

The .data attribute gives you the string contents of
a text node. You could use this as an XML document, and
parse it again to an XML parser. This would be ugly,
but might be your only choice if the producer of the
document is unwilling to adjust.

Regards,
Martin
Jul 18 '05 #12
On Thu, 20 Jan 2005 21:54:30 +0100, Martin v. Löwis wrote:
Luis P. Mendes wrote:
When I access the url via the Firefox browser and look into the source
code, I also get:

<?xml version="1.0" encoding="utf-8"?> <string
xmlns="http.... ............">& lt;DataSet&gt; ~ &lt;Order&gt ;
~ &lt;Customer&gt ;439&lt;/Customer&gt; ~ &lt;/Order&gt;
&lt;/DataSet&gt;</string>


Please do try to understand what you are seeing. This is crucial for
understanding what happens.


From extremely painful and lengthy personal experience, Luis, I
***extremely*** strongly recommend taking the time to nail this down until
you really, really, really understand what is going on. Until you can
explain it to somebody else coherently, ideally.

Mixing escaping levels like this absolutely, positively *must* be done
correctly, or extremely-painful-to-debug problems will result.

(My painful experience was layering an RPC implementation in plain text on
top of IM messages, where I was dealing with everything from the socket
level up except the XML parser. Ultimately it turned out there was a
problem in the XML parser, it rendered "&amp;amp;" as "&", which is wrong
wrong wrong. But that took a *long* time to find, especially as I had
other bugs in the way.)

Since you're layering XML in XML, test &amp;amp; and &amp;amp;amp ; to make
sure they work correctly; those usually show encoding errors. And, given
your current understanding of the issue, do not write your own decoding
function unless you absolutely can't avoid it.
Jul 18 '05 #13
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

~From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

I'll try to explain:

xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

Or in other words:

Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

I'd like to thank everyone for taking the time to answer me.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8UIOHn4 UHCY8rB8RAgK4AK CiHjPdkCKnirX4g EIawT9hBp3HmQCd GoFK
3IEMLLXwMZKvNoq A4tISVnI=
=jvOU
-----END PGP SIGNATURE-----
Jul 18 '05 #14
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

~From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?

I'll try to explain:

xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?

Or in other words:

Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?

I'd like to thank everyone for taking the time to answer me.
Luis
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFB8UIOHn4 UHCY8rB8RAgK4AK CiHjPdkCKnirX4g EIawT9hBp3HmQCd GoFK
3IEMLLXwMZKvNoq A4tISVnI=
=jvOU
-----END PGP SIGNATURE-----
Jul 18 '05 #15
Luis P. Mendes wrote:
xml producer writes the code in Windows platform and 'thinks' that every
client will read/parse the code with a specific Windows parser. Could
that (wrong) XML code parse correctly in that kind of specific Windows
client?
not if it's an XML parser.
Do you know any windows parser that could turn that erroneous encoding
to a xml tree, with four or five inner levels of tags?


any parser *can* do that, but I doubt many parsers will do it unless
you ask it to (by extracting the string and parsing it again). here's the
elementtree version:

from elementtree.Ele mentTree import parse, XML

wrapper = parse(urllib.ur lopen(url))
dataset = XML(wrapper.fin dtext("{http://www......}strin g"))

</F>

Jul 18 '05 #16
Luis P. Mendes wrote:
From your experience, do you think that if this wrong XML code could be
meant to be read only by somekind of Microsoft parser, the error will
not occur?


This is very unlikely. MSXML would never do this incorrectly.

Regards,
Martin
Jul 18 '05 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
8565
by: BTnews | last post by:
Hi, Can anyone here point me at a definitive guide or tutorial about using escape characters when building SQL queries from user entered data? I'm especially interested in info on this in regard to Access databases and (classic) ASP. I've been writing ASP for just over a year now, and I've usually found very comprehensive answers to other problems on one of the many excellent website resources out there. The coverage of this...
28
2860
by: Fabian | last post by:
I use the following to parse the url var srch = window.location.search.substring(1); // then split srch at the ampersand: var parts = srch.split("&"); // write the parameters into the variables for(var i in parts) { var temp = parts.split("="); if (temp == "xx") { xx = 1 * temp; } if (temp == "yy") { yy = 1 * temp; }
2
43826
by: Matthew Wieder | last post by:
In my previous post, I asked about a routine which prepares a string for an XPath query by taking care of escape characters. Unable to find a list, I'm now wondering assumign I enclose the attribute value in quotes in my XPath query, what other escape characters need to be handled aside from a quotation mark? As I understand it, an apostrophe wouldn't be a problem since I'm enclosing the string in quotes. thanks!
7
96332
by: teachtiro | last post by:
Hi, 'C' says \ is the escape character to be used when characters are to be interpreted in an uncommon sense, e.g. \t usage in printf(), but for printing % through printf(), i have read that %% should be used. Wouldn't it have been better (from design perspective) if the same escape character had been used in this case too. Forgive me for posting without verfying things with any standard compiler, i don't have the means for now.
4
7472
by: Guadala Harry | last post by:
I need to place the following into a string... How can I properly escape the % " / < and > characters? <table width="100%" border="0" cellspacing="0" cellpadding="4px" class="hfAll"></Table> Thanks.
3
2249
by: Guadala Harry | last post by:
I'd like to know the answer to the following question so I can know what to expect with regard to other similar uses of escape characters and strings. While everything works fine - I'd like to know specifically why: I am building a simple HTML table in my C# code-behind by concatenating strings that contain different parts of the table and table content... something like this: string myTable = "<table width=\"100%\" border=\"0\"...
15
18322
by: pkaeowic | last post by:
I am having a problem with the "escape" character \e. This code is in my Windows form KeyPress event. The compiler gives me "unrecognized escape sequence" even though this is documented in MSDN. Any idea if this is a bug? if (e.KeyChar == '\e') { this.Close(); }
131
9289
by: Lawrence D'Oliveiro | last post by:
The "escape" function in the "cgi" module escapes characters with special meanings in HTML. The ones that need escaping are '<', '&' and '"'. However, cgi.escape only escapes the quote character if you pass a second argument of True (the default is False): 'the "quick" &amp; &lt;brown&gt; fox' 'the &quot;quick&quot; &amp; &lt;brown&gt; fox' This seems to me to be dumb. The default option should be the safe one: that is, escape _all_ the potentially troublesome...
0
1755
by: Marijn | last post by:
Hello, This is my first post to this forum, because until now Perl has been one of the most convenient and intuitive programming languages I know. However, now I have an issue that I cannot resolve. I want to pass a string as a command line argument, and (for the purposes of this question) print the string. This works fine, except for the fact that escape characters are not parsed. The code: $output1 = $ARGV; $output2 = "line 1\nline...
0
9518
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10433
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
10212
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9035
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7538
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5560
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4112
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3720
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2919
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.