By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,965 Members | 1,631 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,965 IT Pros & Developers. It's quick & easy.

how to use structured markup tools

P: n/a

I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

..
..
<a><b>x1</b><c>y1</c></a>
..
..
<a><b>x2</b><c>y2</c></a>
..
..
<a><b>x3</b><c>y3</c></a>
..
..

....I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

Peace,
STM

Jul 18 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Sean McIlroy wrote:
I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...
.
<a><b>x1</b><c>y1</c></a>
.
<a><b>x2</b><c>y2</c></a>
.
<a><b>x3</b><c>y3</c></a>
.
...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]


how about:

from elementtree import ElementTree

TEXT = """\
<doc>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</doc>
"""

tree = ElementTree.XML(TEXT)

data = []

for elem in tree.findall(".//a"):
data.append((elem.findtext("b"), elem.findtext("c")))

print data

=> [('x1', 'y1'), ('x2', 'y2'), ('x3', 'y3')]

more here:

http://effbot.org/zone/element-index.htm

</F>

Jul 18 '05 #2

P: n/a
Exactly what I was looking for. Thanks.

Jul 18 '05 #3

P: n/a
On Sat, 2005-03-19 at 00:14 -0800, Sean McIlroy wrote:
I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

.
.
<a><b>x1</b><c>y1</c></a>
.
.
<a><b>x2</b><c>y2</c></a>
.
.
<a><b>x3</b><c>y3</c></a>
.
.

...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.


There are many tools you can use to get this done in Python. Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )

DOC = """\
<matrix>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</matrix>
"""

from amara import binderytools

matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append((unicode(row.b), unicode(row.c)))

print matrix

Which outputs:

[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]

If your matrix actually has a variable or previously unknown number of
columns (e.g. <a><b>x1</b><c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:

for row in binderytools.pushbind(u'a', string=DOC):
matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))

Same output, of course. I even tested it for you in Amara 0.9.4. And
what the heck, while I was there, I added it to the demos.

You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerwork...xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xm...x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xm...-tiplook2.html

Jul 18 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.