XML parsing per record

Willem Ligtenberg

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.

Jul 19 '05 #1

Subscribe Post Reply

1652

Irmen de Jong

Willem Ligtenberg wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.

Read about SAX parsers.
This may be of help:
http://www.devarticles.com/c/a/XML/P...AX-and-Python/

Out of curiousity, why is the data stored in a XML file?
XML is not known for its efficiency....

--Irmen

Jul 19 '05 #2

Ivan Voras

Irmen de Jong wrote:

XML is not known for its efficiency....

<sarcasm> Surely you are blaspheming, sir! XML's the greatest thing
since peanut butter! </sarcasm>

I'm just *waiting* for the day someone finds its use on the rolls of
toilet paper... oh the glorious day...

Jul 19 '05 #3

Kent Johnson

Willem Ligtenberg wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_...ment-generator

Kent

Jul 19 '05 #4

Fredrik Lundh

Kent Johnson wrote:

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.

You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_...ment-generator

if you have ElementTree 1.2.5 or later, the "iterparse" function provides a
more efficient implementation of that pattern:

http://effbot.org/zone/element-iterparse.htm

the cElementTree implemention of "iterparse" is a lot faster than SAX; see
the second table under

http://effbot.org/zone/celementtree.htm#benchmarks

for some figures.

</F>

Jul 19 '05 #5

William Park

Willem Ligtenberg <WL*********@gmail.com> wrote:

I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.

Jul 19 '05 #6

Fredrik Lundh

William Park wrote:

You may want to try Expat (www.libexpat.org) or Python wrapper to it.

Python comes with a low-level expat wrapper (pyexpat).

however, if you want performance, cElementTree (which also uses expat) is a
lot faster than pyexpat. (see my other post for links to benchmarks and code).

</F>

Jul 19 '05 #7

Willem Ligtenberg

On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:

Willem Ligtenberg <WL*********@gmail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.

You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>9996</Gene-track_geneid>
<Gene-track_status value="secondary">1</Gene-track_status>
<Gene-track_current-id>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
<Dbtag>
<Dbtag_db>GeneID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Gene-track_current-id>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>8</Date-std_month>
<Date-std_day>28</Date-std_day>
<Date-std_hour>21</Date-std_hour>
<Date-std_minute>39</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2005</Date-std_year>
<Date-std_month>2</Date-std_month>
<Date-std_day>17</Date-std_day>
<Date-std_hour>12</Date-std_hour>
<Date-std_minute>54</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="genomic">1</BioSource_genome>
<BioSource_origin value="natural">1</BioSource_origin>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Mus musculus</Org-ref_taxname>
<Org-ref_common>house mouse</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>10090</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_syn>
<Org-ref_syn_E>mouse</Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name>
<OrgName_name_binomial>
<BinomialOrgName>
<BinomialOrgName_genus>Mus</BinomialOrgName_genus>
<BinomialOrgName_species>musculus</BinomialOrgName_species>
</BinomialOrgName>
</OrgName_name_binomial>
</OrgName_name>
<OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus</OrgName_lineage>
<OrgName_gcode>1</OrgName_gcode>
<OrgName_mgcode>2</OrgName_mgcode>
<OrgName_div>ROD</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
</BioSource>
</Entrezgene_source>
<Entrezgene_gene>
<Gene-ref>
</Gene-ref>
</Entrezgene_gene>
<Entrezgene_gene-source>
<Gene-source>
<Gene-source_src>LocusLink</Gene-source_src>
<Gene-source_src-int>9996</Gene-source_src-int>
<Gene-source_src-str2>9996</Gene-source_src-str2>
<Gene-source_gene-display value="false"/>
<Gene-source_locus-display value="false"/>
<Gene-source_extra-terms value="false"/>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_locus>
<Gene-commentary>
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
<Gene-commentary_version>0</Gene-commentary_version>
</Gene-commentary>
</Entrezgene_locus>
<Entrezgene_unique-keys>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>9996</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Entrezgene_unique-keys>
<Entrezgene_xtra-index-terms>
<Entrezgene_xtra-index-terms_E>LOC320632</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>

Jul 19 '05 #8

Kent Johnson

Willem Ligtenberg wrote:

Willem Ligtenberg <WL*********@gmail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>

This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent

Jul 19 '05 #9

William Park

Willem Ligtenberg <WL*********@gmail.com> wrote:

On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
Care to post more details?
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.

You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.

Jul 19 '05 #10

Willem Ligtenberg

This is all the info I need from the xml file:
ID --> <Gene-track_geneid>320632</Gene-track_geneid>

Name --> <Gene-ref>
<Gene-ref_locus>Pzp</Gene-ref_locus>

Startbase --> <Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>126957426</Seq-interval_from>
<Seq-interval_to>126989473</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>51860766</Seq-id_gi>
</Seq-id>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</Seq-loc>
</Gene-commentary_seqs>
Endbase

Function --> <Prot-ref_name>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa</Prot-ref_name_E>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)</Prot-ref_name_E>
</Prot-ref_name>

DBLink --> <Gene-ref_locus-tag>MGI:2444401</Gene-ref_locus-tag>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>5524</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Other-source_src>
<Other-source_anchor>ATP binding</Other-source_anchor>
<Other-source_post-text>evidence: ISS</Other-source_post-text>
</Other-source>
</Gene-commentary_source>

Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type>

gene-comment --> <Gene-ref_desc>activating signal cointegrator 1 complex subunit 3-like
1</Gene-ref_desc>

synonym --> <Gene-ref_syn>
<Gene-ref_syn_E>HELIC2</Gene-ref_syn_E>
<Gene-ref_syn_E>KIAA0788</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200KD</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200-KD</Gene-ref_syn_E>
<Gene-ref_syn_E>A330064G03Rik</Gene-ref_syn_E>
</Gene-ref_syn>

EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec>

Chromosome: <SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>6</SubSource_name>
</SubSource>

Some can happen more than once in a record.
On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote:

Willem Ligtenberg <WL*********@gmail.com> wrote:
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
> Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.

You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>

Jul 19 '05 #11

Willem Ligtenberg

As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg

On Fri, 22 Apr 2005 13:48:15 +0200, Willem Ligtenberg wrote:

This is all the info I need from the xml file:
ID --> <Gene-track_geneid>320632</Gene-track_geneid>

Name --> <Gene-ref>
<Gene-ref_locus>Pzp</Gene-ref_locus>

Startbase --> <Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>126957426</Seq-interval_from>
<Seq-interval_to>126989473</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>51860766</Seq-id_gi>
</Seq-id>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</Seq-loc>
</Gene-commentary_seqs>
Endbase

Function --> <Prot-ref_name>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa</Prot-ref_name_E>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)</Prot-ref_name_E>
</Prot-ref_name>

DBLink --> <Gene-ref_locus-tag>MGI:2444401</Gene-ref_locus-tag>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>5524</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Other-source_src>
<Other-source_anchor>ATP binding</Other-source_anchor>
<Other-source_post-text>evidence: ISS</Other-source_post-text>
</Other-source>
</Gene-commentary_source>

Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type>

gene-comment --> <Gene-ref_desc>activating signal cointegrator 1 complex subunit 3-like
1</Gene-ref_desc>

synonym --> <Gene-ref_syn>
<Gene-ref_syn_E>HELIC2</Gene-ref_syn_E>
<Gene-ref_syn_E>KIAA0788</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200KD</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200-KD</Gene-ref_syn_E>
<Gene-ref_syn_E>A330064G03Rik</Gene-ref_syn_E>
</Gene-ref_syn>

EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec>

Chromosome: <SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>6</SubSource_name>
</SubSource>

Some can happen more than once in a record.
On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote:
Willem Ligtenberg <WL*********@gmail.com> wrote:
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
> Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.

You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>

Jul 19 '05 #12

Willem Ligtenberg

By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...

On Fri, 22 Apr 2005 15:22:17 +0200, Willem Ligtenberg wrote:

As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg

Jul 19 '05 #13

Fredrik Lundh

Willem Ligtenberg wrote:

As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>

Jul 19 '05 #14

Willem Ligtenberg

As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

On Fri, 22 Apr 2005 15:47:08 +0200, Fredrik Lundh wrote:

Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>

Jul 19 '05 #15

Fredrik Lundh

Willem Ligtenberg wrote:

By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...

for x in function:
print 'function', x.text

</F>

Jul 19 '05 #16

William Park

Willem Ligtenberg <WL*********@gmail.com> wrote:
....

ID --> <Gene-track_geneid>320632</Gene-track_geneid> .... Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type> .... EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec> ....
Some can happen more than once in a record.

Since all your data are contained in unique tags on individual lines,
you can tackle this so many different ways. Okey, that's your input
format. What is your output format?

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.

Jul 19 '05 #17

Willem Ligtenberg

Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_str>1234</Object-id_str>
</Object-id>
</Dbtag_tag>
</Dbtag>

And sometimes:

<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>1234</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
If not, I still might need to revert to SAX... :(

On Fri, 22 Apr 2005 15:56:29 +0200, Willem Ligtenberg wrote:

As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

On Fri, 22 Apr 2005 15:47:08 +0200, Fredrik Lundh wrote:
Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>

Jul 19 '05 #18

Fredrik Lundh

Willem Ligtenberg wrote:

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
why not just check for both alternatives?

text = elem.findtext("Object-id_str")
if text is None:
text = elem.findtext("Object-id_id")

(or you can loop over the child elements and map elem.tag through a
dictionary...)
If not, I still might need to revert to SAX... :(

you still have to check for both alternatives...

(if you find a parsing problem that you cannot solve with a light-weight
DOM, SAX won't help you...)

</F>

Jul 19 '05 #19

Willem Ligtenberg

Since there are more than one database references possible per record you
should get per record a list of database names, database strings and
databases ids. (where the strings and the id's are really the same thing...)
So per record you check for both alternatives but since there could be
more than one, you do findall and get a (unsorted) list back. And now you
don't know which ID belonged to which database...
See my problem?

Cheers,

Willem

On Fri, 22 Apr 2005 19:38:03 +0200, Fredrik Lundh wrote:

Willem Ligtenberg wrote:
So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID

why not just check for both alternatives?

text = elem.findtext("Object-id_str")
if text is None:
text = elem.findtext("Object-id_id")

(or you can loop over the child elements and map elem.tag through a
dictionary...)
If not, I still might need to revert to SAX... :(

you still have to check for both alternatives...

(if you find a parsing problem that you cannot solve with a light-weight
DOM, SAX won't help you...)

</F>

Jul 19 '05 #20

Fredrik Lundh

Willem Ligtenberg wrote:

Since there are more than one database references possible per record you
should get per record a list of database names, database strings and
databases ids. (where the strings and the id's are really the same thing...)
So per record you check for both alternatives but since there could be
more than one, you do findall and get a (unsorted) list back.
findall returns matching elements in document order.
And now you don't know which ID belonged to which database...
why not? by looking at each database separately, surely you must be
able to figure out if the subelement holds an ID or a string? sure, if you
do document.findall(".//Object-id_id"), you'll get all IDs in document
order. but if you do record.findall(".//Dbtag"), you get a list of all Dbtag
elements, and can then look inside them to see what they contain.
See my problem?

I'm afraid not. the document seems to have a clear structure; for some
reason, you don't seem to take that into account in your program.

</F>

Jul 19 '05 #21

Fredrik Lundh

order. but if you do record.findall(".//Dbtag"), you get a list of all Dbtag
elements

make that "you get a list of all Dbtag elements in that record"

</F>

Jul 19 '05 #22

Kent Johnson

Willem Ligtenberg wrote:

Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_str>1234</Object-id_str>
</Object-id>
</Dbtag_tag>
</Dbtag>

And sometimes:

<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>1234</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
If not, I still might need to revert to SAX... :(

None of your requirements sound particularly difficult to implement. If you would post a complete
example of the data you want to parse and the data you would like to end up it would be easier to
help you. The sample data you posted originally does not have many of the fields you want to extract
and your example of what you want to end up with is not too clear either.

If you are having trouble with ElementTree I expect you will be completely lost with SAX,
ElementTree is much easier to work with and cElementTree is very fast.

Kent

Jul 19 '05 #23

XML parsing per record

Similar topics