473,406 Members | 2,713 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Getting elements and text with lxml

Hello,

I have an XML file that starts with:

<vortaro>
<art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
<kap>
<ofc>*</ofc>-<rad>a</rad>
</kap>

out of it, I'd like to extract something like (I'm just showing one
structure, any structure as long as all data is there is fine):

[("ofc", "*"), "-", ("rad", "a")]

How can I do it? I managed to get the content of boths tags and the
text up to the first tag ("\n "), but not the - (and in other XML
files, there's more text outside the elements).

Thanks.
Jun 27 '08 #1
5 2464
En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pu****@pupeno.com>
escribió:
Hello,

I have an XML file that starts with:

<vortaro>
<art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
<kap>
<ofc>*</ofc>-<rad>a</rad>
</kap>

out of it, I'd like to extract something like (I'm just showing one
structure, any structure as long as all data is there is fine):

[("ofc", "*"), "-", ("rad", "a")]

How can I do it? I managed to get the content of boths tags and the
text up to the first tag ("\n "), but not the - (and in other XML
files, there's more text outside the elements).
Look for the "tail" attribute.

--
Gabriel Genellina

Jun 27 '08 #2
On May 17, 2:19*am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pup...@pupeno.com*
escribió:
Hello,
I have an XML file that starts with:
<vortaro>
<art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
<kap>
* <ofc>*</ofc>-<rad>a</rad>
</kap>
out of it, I'd like to extract something like (I'm just showing one
structure, any structure as long as all data is there is fine):
[("ofc", "*"), "-", ("rad", "a")]
How can I do it? I managed to get the content of boths tags and the
text up to the first tag ("\n * "), but not the - (and in other XML
files, there's more text outside the elements).

Look for the "tail" attribute.
That gives me the last part, but not the one in the middle:

In : etree.tounicode(e)
Out: u'<kap>\n <ofc>*</ofc>-<rad>a</rad>\n</kap>\n'

In : e.text
Out: '\n '

In : e.tail
Out: '\n'

Thanks.
Jun 27 '08 #3
J. Pablo Fernández wrote:
On May 17, 2:19 am, "Gabriel Genellina" <gagsl-...@yahoo.com.ar>
wrote:
>En Fri, 16 May 2008 18:53:03 -0300, J. Pablo Fernández <pup...@pupeno.com>
escribió:
>>Hello,
I have an XML file that starts with:
<vortaro>
<art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
<kap>
<ofc>*</ofc>-<rad>a</rad>
</kap>
out of it, I'd like to extract something like (I'm just showing one
structure, any structure as long as all data is there is fine):
[("ofc", "*"), "-", ("rad", "a")]
How can I do it? I managed to get the content of boths tags and the
text up to the first tag ("\n "), but not the - (and in other XML
files, there's more text outside the elements).
Look for the "tail" attribute.

That gives me the last part, but not the one in the middle:

In : etree.tounicode(e)
Out: u'<kap>\n <ofc>*</ofc>-<rad>a</rad>\n</kap>\n'

In : e.text
Out: '\n '

In : e.tail
Out: '\n'
You need the text content of your initial element's children, which
needs that of their children, and so on.

See http://effbot.org/zone/element-bits-and-pieces.htm

HTH,
John
Jun 27 '08 #4
J. Pablo Fernández wrote:
I have an XML file that starts with:

<vortaro>
<art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
<kap>
<ofc>*</ofc>-<rad>a</rad>
</kap>

out of it, I'd like to extract something like (I'm just showing one
structure, any structure as long as all data is there is fine):

[("ofc", "*"), "-", ("rad", "a")]
>>root = etree.fromstring(xml)
l = []
for el in root.iter(): # or root.getiterator()
... l.append((el, el.text))
... l.append(el.text)

or maybe this is enough:

list(root.itertext())

Stefan
Jun 27 '08 #5
On May 17, 4:17*pm, Stefan Behnel <stefan...@behnel.dewrote:
J. Pablo Fernández wrote:
I have an XML file that starts with:
<vortaro>
<art mrk="$Id: a.xml,v 1.10 2007/09/11 16:30:20 revo Exp $">
<kap>
* <ofc>*</ofc>-<rad>a</rad>
</kap>
out of it, I'd like to extract something like (I'm just showing one
structure, any structure as long as all data is there is fine):
[("ofc", "*"), "-", ("rad", "a")]

* * >>root = etree.fromstring(xml)
* * >>l = []
* * >>for el in root.iter(): * *# or root.getiterator()
* * ... * * l.append((el, el.text))
* * ... * * l.append(el.text)

or maybe this is enough:

* * list(root.itertext())

Stefan
Hello,

My object doesn't have iter() or itertext(), it only has:
iterancestors, iterchildren, iterdescendants, itersiblings.

Thanks.
Jun 27 '08 #6

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Jan Dries | last post by:
I'm trying to find Windows binaries for lxml. The cheeseshop is supposed to have such binaries, but I can't find them. Does anyone know where I might find such binaries? Thanks, Jan
9
by: sebzzz | last post by:
Hi, I work at this company and we are re-building our website: http://caslt.org/. The new website will be built by an external firm (I could do it myself, but since I'm just the summer student...
12
by: kublai | last post by:
Hello, For a project, I need to develop a corpus of online news stories. I'm looking for an application that, given the url of a web page, "copies" the rendered text of the web page (not the...
8
by: geoffbache | last post by:
I have some marked up text and would like to convert it to plain text, by simply removing all the tags. Of course I can do it from first principles but I felt that among all Python's markup tools...
0
by: Stefan Behnel | last post by:
Hi everyone, I'm very happy to announce the official release of lxml 2.0! http://codespeak.net/lxml/ http://pypi.python.org/pypi/lxml/2.0 ** What is lxml? """
0
by: Frank Cusack | last post by:
Is it possible to require one or more from a list of optional elements? If I have something like: <element name="parent"> <oneOrMore> <interleave> <optional> <element name="child1"> <text/>...
1
by: =?iso-8859-1?q?KLEIN_St=E9phane?= | last post by:
Hi, I'm on Ubuntu 8.04.1 I've installed lxml with easy_install lxml command. Now, when I load etree I've this error : $ python Python 2.5.2 (r252:60911, Apr 21 2008, 11:12:42)
0
by: =?iso-8859-1?q?KLEIN_St=E9phane?= | last post by:
Le Mon, 25 Aug 2008 13:50:50 +0000, KLEIN Stéphane a écrit : I've this bug only with lxml2, lxml 1.3.3 work very well. Regards, Stephane
1
by: Owen Zhang | last post by:
I am trying to build lxml package in SunOS 5.10. I got the following errors. Does anybody know why? $ python setup.py build Building lxml version 2.1. NOTE: Trying to build without Cython,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.