473,414 Members | 1,677 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,414 software developers and data experts.

how to use structured markup tools


I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

..
..
<a><b>x1</b><c>y1</c></a>
..
..
<a><b>x2</b><c>y2</c></a>
..
..
<a><b>x3</b><c>y3</c></a>
..
..

....I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

Peace,
STM

Jul 18 '05 #1
3 1229
Sean McIlroy wrote:
I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...
.
<a><b>x1</b><c>y1</c></a>
.
<a><b>x2</b><c>y2</c></a>
.
<a><b>x3</b><c>y3</c></a>
.
...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]


how about:

from elementtree import ElementTree

TEXT = """\
<doc>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</doc>
"""

tree = ElementTree.XML(TEXT)

data = []

for elem in tree.findall(".//a"):
data.append((elem.findtext("b"), elem.findtext("c")))

print data

=> [('x1', 'y1'), ('x2', 'y2'), ('x3', 'y3')]

more here:

http://effbot.org/zone/element-index.htm

</F>

Jul 18 '05 #2
Exactly what I was looking for. Thanks.

Jul 18 '05 #3
On Sat, 2005-03-19 at 00:14 -0800, Sean McIlroy wrote:
I'm dealing with XML files in which there are lots of tags of the
following form: <a><b>x</b><c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

.
.
<a><b>x1</b><c>y1</c></a>
.
.
<a><b>x2</b><c>y2</c></a>
.
.
<a><b>x3</b><c>y3</c></a>
.
.

...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.


There are many tools you can use to get this done in Python. Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )

DOC = """\
<matrix>
<a><b>x1</b><c>y1</c></a>
<a><b>x2</b><c>y2</c></a>
<a><b>x3</b><c>y3</c></a>
</matrix>
"""

from amara import binderytools

matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append((unicode(row.b), unicode(row.c)))

print matrix

Which outputs:

[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]

If your matrix actually has a variable or previously unknown number of
columns (e.g. <a><b>x1</b><c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:

for row in binderytools.pushbind(u'a', string=DOC):
matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))

Same output, of course. I even tested it for you in Amara 0.9.4. And
what the heck, while I was there, I added it to the demos.

You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerwork...xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xm...x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xm...-tiplook2.html

Jul 18 '05 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Dirk Rudolf | last post by:
I like to announce you the product X2U, avaible under http://www.lumrix.net/x2. X2U is an acronym for "XML to user". Existing XML editors still ignore the fact that users don't want to read XML...
0
by: Pomax | last post by:
Does anyone know of a good standoff markup tool that isn't Gate? I've tried to work with gate but it's very obviously not meant as a commercial or intuitive product, so I'm looking for programs...
38
by: Jukka K. Korpela | last post by:
As well all know, valid markup is important... but when trying to find a convincing modern argument in favor of this, I found pages like http://www.htmlhelp.com/tools/validator/reasons.html which...
1
by: r.shimmin | last post by:
There exist a number of related informal markup languages whose design philosophy is to use terse, easily human-entered and human-read tags, that are intended to be converted by software into some...
1
by: Trebek | last post by:
Hello grp: I have a situation I was hoping someone might be able to suggest a solution. I am retrieving html from a url and storing this information in Sql Server. Our web service supplies this...
1
by: Scott Abel | last post by:
Tony Self of HyperWrite presents an interesting and informative article entitled "Semantic, Structured Authoring: The Challenge for Technical Writers" that is sure to be of use to many technical...
0
by: kowmudi | last post by:
hi all, I am working on data conversions and Im very new to the field of XML so would be very happy if i do get a helping hand on my work.. Iam converting PDF files to an XML file and for this...
16
by: Andreas Prilop | last post by:
I have three test pages that are marked as Italian, Spanish, Portuguese, resp. by Content-Language: it <html lang="it"> <body lang="it"> and the same for "es" and "pt". Yahoo regards all...
9
by: Daniele Perilli | last post by:
Hi everybody, I'd like to introduce you a new little tool I developed to automatically check markup validation of all pages in given websites. It uses W3C HTML Validator and CSS Validator online...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.