how to use structured markup tools

Sean McIlroy

I'm dealing with XML files in which there are lots of tags of the
following form: <a>x<c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

..
..
<a>x1<c>y1</c></a>
..
..
<a>x2<c>y2</c></a>
..
..
<a>x3<c>y3</c></a>
..
..

....I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

Peace,
STM

Jul 18 '05 #1

Subscribe Post Reply

1229

Fredrik Lundh

Sean McIlroy wrote:

I'm dealing with XML files in which there are lots of tags of the
following form: <a>x<c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...
.
<a>x1<c>y1</c></a>
.
<a>x2<c>y2</c></a>
.
<a>x3<c>y3</c></a>
.
...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

how about:

from elementtree import ElementTree

TEXT = """\
<doc>
<a>x1<c>y1</c></a>
<a>x2<c>y2</c></a>
<a>x3<c>y3</c></a>
</doc>
"""

tree = ElementTree.XML(TEXT)

data = []

for elem in tree.findall(".//a"):
data.append((elem.findtext("b"), elem.findtext("c")))

print data

=> [('x1', 'y1'), ('x2', 'y2'), ('x3', 'y3')]

more here:

http://effbot.org/zone/element-index.htm

</F>

Jul 18 '05 #2

Sean McIlroy

Exactly what I was looking for. Thanks.

Jul 18 '05 #3

Uche Ogbuji

On Sat, 2005-03-19 at 00:14 -0800, Sean McIlroy wrote:

I'm dealing with XML files in which there are lots of tags of the
following form: <a>x<c>y</c></a> (all of these letters are being
used as 'metalinguistic variables') Not all of the tags in the file are
of that form, but that's the only type of tag I'm interested in. (For
the insatiably curious, I'm talking about a conversation log from MSN
Messenger.) What I need to do is to pull out all the x's and y's in a
form I can use. In other words, from...

.
.
<a>x1<c>y1</c></a>
.
.
<a>x2<c>y2</c></a>
.
.
<a>x3<c>y3</c></a>
.
.

...I would like to produce, for example,...

[ (x1,y1), (x2,y2), (x3,y3) ]

Now, I'm aware that there are extensive libraries for dealing with
marked-up text, but here's the thing: I think I have a reasonable
understanding of python, but I use it in a lisplike way, and in
particular I only know the rudiments of how classes work. So here's
what I'm asking for:

Can anybody give me a rough idea how to come to grips with the problem
described above? Or even (dare to dream) example code? Any help will be
very much appreciated.

There are many tools you can use to get this done in Python. Here's a
recipe using Amara ( http://www.xml.com/pub/a/2005/01/19/amara.html )

DOC = """\
<matrix>
<a>x1<c>y1</c></a>
<a>x2<c>y2</c></a>
<a>x3<c>y3</c></a>
</matrix>
"""

from amara import binderytools

matrix = []
for row in binderytools.pushbind(u'a', string=DOC):
matrix.append((unicode(row.b), unicode(row.c)))

print matrix

Which outputs:

[(u'x1', u'y1'), (u'x2', u'y2'), (u'x3', u'y3')]

If your matrix actually has a variable or previously unknown number of
columns (e.g. <a>x1<c>y1</c><d>z1</d></a> ), the following
version of the for loop is a more general solution:

for row in binderytools.pushbind(u'a', string=DOC):
matrix.append(tuple([ unicode(e) for e in row.xml_xpath(u'*') ]))

Same output, of course. I even tested it for you in Amara 0.9.4. And
what the heck, while I was there, I added it to the demos.

You can make things even more obfuscated^H^H^H^H^H^H^H^H^H^Hterse using
further lambda or list comp tricks, but I leave that as an exercise for
the perverse ;-)
--
Uche Ogbuji Fourthought, Inc.
http://uche.ogbuji.net http://4Suite.org http://fourthought.com
Use CSS to display XML, part 2 - http://www-128.ibm.com/developerwork...xmlcss2-i.html
Writing and Reading XML with XIST - http://www.xml.com/pub/a/2005/03/16/py-xml.html
Introducing the Amara XML Toolkit - http://www.xml.com/pub/a/2005/01/19/amara.ht
Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286
Querying WordNet as XML - http://www.ibm.com/developerworks/xm...x-think29.html
Packaging XSLT lookup tables as EXSLT functions - http://www.ibm.com/developerworks/xm...-tiplook2.html

Jul 18 '05 #4

by: Dirk Rudolf | last post by:

I like to announce you the product X2U, avaible under http://www.lumrix.net/x2. X2U is an acronym for "XML to user". Existing XML editors still ignore the fact that users don't want to read XML...

.NET Framework

standoff XML markup tools?

by: Pomax | last post by:

Does anyone know of a good standoff markup tool that isn't Gate? I've tried to work with gate but it's very obviously not meant as a commercial or intuitive product, so I'm looking for programs...

.NET Framework

Looking for arguments in favor of valid markup

by: Jukka K. Korpela | last post by:

As well all know, valid markup is important... but when trying to find a convincing modern argument in favor of this, I found pages like http://www.htmlhelp.com/tools/validator/reasons.html which...

HTML / CSS

tool for defining informal markup languages ?

by: r.shimmin | last post by:

There exist a number of related informal markup languages whose design philosophy is to use terse, easily human-entered and human-read tags, that are intended to be converted by software into some...

.NET Framework

Markup to Text

by: Trebek | last post by:

Hello grp: I have a situation I was hoping someone might be able to suggest a solution. I am retrieving html from a url and storing this information in Sql Server. Our web service supplies this...

C# / C Sharp

"Semantic, Structured Authoring: The Challenge for Technical Writers"

by: Scott Abel | last post by:

Tony Self of HyperWrite presents an interesting and informative article entitled "Semantic, Structured Authoring: The Challenge for Technical Writers" that is sure to be of use to many technical...

HTML / CSS

data conversion -PDF to good structured XML

by: kowmudi | last post by:

hi all, I am working on data conversions and Im very new to the field of XML so would be very happy if i do get a helping hand on my work.. Iam converting PDF files to an XML file and for this...

XML

Search engines continue to ignore LANG markup

by: Andreas Prilop | last post by:

I have three test pages that are marked as Italian, Spanish, Portuguese, resp. by Content-Language: it <html lang="it"> <body lang="it"> and the same for "es" and "pt". Yahoo regards all...

HTML / CSS

Markup Validator

by: Daniele Perilli | last post by:

Hi everybody, I'd like to introduce you a new little tool I developed to automatically check markup validation of all pages in given websites. It uses W3C HTML Validator and CSS Validator online...

HTML / CSS

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

how to use structured markup tools

Similar topics