473,405 Members | 2,310 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,405 software developers and data experts.

XML parsing per record

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.
Jul 19 '05 #1
22 1652
Willem Ligtenberg wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.

Read about SAX parsers.
This may be of help:
http://www.devarticles.com/c/a/XML/P...AX-and-Python/

Out of curiousity, why is the data stored in a XML file?
XML is not known for its efficiency....

--Irmen
Jul 19 '05 #2
Irmen de Jong wrote:
XML is not known for its efficiency....


<sarcasm> Surely you are blaspheming, sir! XML's the greatest thing
since peanut butter! </sarcasm>

I'm just *waiting* for the day someone finds its use on the rolls of
toilet paper... oh the glorious day...

Jul 19 '05 #3
Willem Ligtenberg wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?


You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_...ment-generator

Kent
Jul 19 '05 #4
Kent Johnson wrote:
So I would like to parse a XML file one record at a time and then be able
to store the information in another object.


You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_...ment-generator


if you have ElementTree 1.2.5 or later, the "iterparse" function provides a
more efficient implementation of that pattern:

http://effbot.org/zone/element-iterparse.htm

the cElementTree implemention of "iterparse" is a lot faster than SAX; see
the second table under

http://effbot.org/zone/celementtree.htm#benchmarks

for some figures.

</F>

Jul 19 '05 #5
Willem Ligtenberg <WL*********@gmail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.


You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.

Jul 19 '05 #6
William Park wrote:
You may want to try Expat (www.libexpat.org) or Python wrapper to it.


Python comes with a low-level expat wrapper (pyexpat).

however, if you want performance, cElementTree (which also uses expat) is a
lot faster than pyexpat. (see my other post for links to benchmarks and code).

</F>

Jul 19 '05 #7
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
Willem Ligtenberg <WL*********@gmail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.


You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?


The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_track-info>
<Gene-track>
<Gene-track_geneid>9996</Gene-track_geneid>
<Gene-track_status value="secondary">1</Gene-track_status>
<Gene-track_current-id>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
<Dbtag>
<Dbtag_db>GeneID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Gene-track_current-id>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>8</Date-std_month>
<Date-std_day>28</Date-std_day>
<Date-std_hour>21</Date-std_hour>
<Date-std_minute>39</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2005</Date-std_year>
<Date-std_month>2</Date-std_month>
<Date-std_day>17</Date-std_day>
<Date-std_hour>12</Date-std_hour>
<Date-std_minute>54</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_track-info>
<Entrezgene_type value="protein-coding">6</Entrezgene_type>
<Entrezgene_source>
<BioSource>
<BioSource_genome value="genomic">1</BioSource_genome>
<BioSource_origin value="natural">1</BioSource_origin>
<BioSource_org>
<Org-ref>
<Org-ref_taxname>Mus musculus</Org-ref_taxname>
<Org-ref_common>house mouse</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>10090</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_syn>
<Org-ref_syn_E>mouse</Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name>
<OrgName_name_binomial>
<BinomialOrgName>
<BinomialOrgName_genus>Mus</BinomialOrgName_genus>
<BinomialOrgName_species>musculus</BinomialOrgName_species>
</BinomialOrgName>
</OrgName_name_binomial>
</OrgName_name>
<OrgName_lineage>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus</OrgName_lineage>
<OrgName_gcode>1</OrgName_gcode>
<OrgName_mgcode>2</OrgName_mgcode>
<OrgName_div>ROD</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
</BioSource>
</Entrezgene_source>
<Entrezgene_gene>
<Gene-ref>
</Gene-ref>
</Entrezgene_gene>
<Entrezgene_gene-source>
<Gene-source>
<Gene-source_src>LocusLink</Gene-source_src>
<Gene-source_src-int>9996</Gene-source_src-int>
<Gene-source_src-str2>9996</Gene-source_src-str2>
<Gene-source_gene-display value="false"/>
<Gene-source_locus-display value="false"/>
<Gene-source_extra-terms value="false"/>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_locus>
<Gene-commentary>
<Gene-commentary_type value="genomic">1</Gene-commentary_type>
<Gene-commentary_version>0</Gene-commentary_version>
</Gene-commentary>
</Entrezgene_locus>
<Entrezgene_unique-keys>
<Dbtag>
<Dbtag_db>LocusID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>9996</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Entrezgene_unique-keys>
<Entrezgene_xtra-index-terms>
<Entrezgene_xtra-index-terms_E>LOC320632</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>

Jul 19 '05 #8
Willem Ligtenberg wrote:
Willem Ligtenberg <WL*********@gmail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?


The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>


This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml'

for event, elem in ElementTree.iterparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext('Entrezgene_track-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent
Jul 19 '05 #9
Willem Ligtenberg <WL*********@gmail.com> wrote:
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
Care to post more details?
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.


You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>


--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.
Jul 19 '05 #10
This is all the info I need from the xml file:
ID --> <Gene-track_geneid>320632</Gene-track_geneid>

Name --> <Gene-ref>
<Gene-ref_locus>Pzp</Gene-ref_locus>

Startbase --> <Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>126957426</Seq-interval_from>
<Seq-interval_to>126989473</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>51860766</Seq-id_gi>
</Seq-id>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</Seq-loc>
</Gene-commentary_seqs>
Endbase

Function --> <Prot-ref_name>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa</Prot-ref_name_E>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)</Prot-ref_name_E>
</Prot-ref_name>

DBLink --> <Gene-ref_locus-tag>MGI:2444401</Gene-ref_locus-tag>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>5524</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Other-source_src>
<Other-source_anchor>ATP binding</Other-source_anchor>
<Other-source_post-text>evidence: ISS</Other-source_post-text>
</Other-source>
</Gene-commentary_source>

Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type>

gene-comment --> <Gene-ref_desc>activating signal cointegrator 1 complex subunit 3-like
1</Gene-ref_desc>

synonym --> <Gene-ref_syn>
<Gene-ref_syn_E>HELIC2</Gene-ref_syn_E>
<Gene-ref_syn_E>KIAA0788</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200KD</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200-KD</Gene-ref_syn_E>
<Gene-ref_syn_E>A330064G03Rik</Gene-ref_syn_E>
</Gene-ref_syn>

EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec>

Chromosome: <SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>6</SubSource_name>
</SubSource>

Some can happen more than once in a record.
On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote:
Willem Ligtenberg <WL*********@gmail.com> wrote:
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
> Care to post more details?


The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.


You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>


Jul 19 '05 #11
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg

On Fri, 22 Apr 2005 13:48:15 +0200, Willem Ligtenberg wrote:
This is all the info I need from the xml file:
ID --> <Gene-track_geneid>320632</Gene-track_geneid>

Name --> <Gene-ref>
<Gene-ref_locus>Pzp</Gene-ref_locus>

Startbase --> <Gene-commentary_seqs>
<Seq-loc>
<Seq-loc_int>
<Seq-interval>
<Seq-interval_from>126957426</Seq-interval_from>
<Seq-interval_to>126989473</Seq-interval_to>
<Seq-interval_strand>
<Na-strand value="plus"/>
</Seq-interval_strand>
<Seq-interval_id>
<Seq-id>
<Seq-id_gi>51860766</Seq-id_gi>
</Seq-id>
</Seq-interval_id>
</Seq-interval>
</Seq-loc_int>
</Seq-loc>
</Gene-commentary_seqs>
Endbase

Function --> <Prot-ref_name>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa</Prot-ref_name_E>
<Prot-ref_name_E>U5 snRNP-specific protein, 200 kDa (DEXH RNA helicase
family)</Prot-ref_name_E>
</Prot-ref_name>

DBLink --> <Gene-ref_locus-tag>MGI:2444401</Gene-ref_locus-tag>
<Gene-commentary_source>
<Other-source>
<Other-source_src>
<Dbtag>
<Dbtag_db>GO</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>5524</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Other-source_src>
<Other-source_anchor>ATP binding</Other-source_anchor>
<Other-source_post-text>evidence: ISS</Other-source_post-text>
</Other-source>
</Gene-commentary_source>

Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type>

gene-comment --> <Gene-ref_desc>activating signal cointegrator 1 complex subunit 3-like
1</Gene-ref_desc>

synonym --> <Gene-ref_syn>
<Gene-ref_syn_E>HELIC2</Gene-ref_syn_E>
<Gene-ref_syn_E>KIAA0788</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200KD</Gene-ref_syn_E>
<Gene-ref_syn_E>U5-200-KD</Gene-ref_syn_E>
<Gene-ref_syn_E>A330064G03Rik</Gene-ref_syn_E>
</Gene-ref_syn>

EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec>

Chromosome: <SubSource>
<SubSource_subtype value="chromosome">1</SubSource_subtype>
<SubSource_name>6</SubSource_name>
</SubSource>

Some can happen more than once in a record.
On Fri, 22 Apr 2005 02:41:46 -0400, William Park wrote:
Willem Ligtenberg <WL*********@gmail.com> wrote:
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
> Care to post more details?

The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.


You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>


Jul 19 '05 #12
By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...

On Fri, 22 Apr 2005 15:22:17 +0200, Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.

Thanks in advance,

Willem Ligtenberg

Jul 19 '05 #13
Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.


findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>

Jul 19 '05 #14
As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

On Fri, 22 Apr 2005 15:47:08 +0200, Fredrik Lundh wrote:
Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.


findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>


Jul 19 '05 #15
Willem Ligtenberg wrote:
By the way, I know about findall, but when I iterate thruogh it like:
for x in function:
print 'function', x

I get:
function <Element 'Prot-ref_name_E' at 0xb7d10cf8>
function <Element 'Prot-ref_name_E' at 0xb7d10d10>

But ofcourse I want the information in there...


for x in function:
print 'function', x.text

</F>

Jul 19 '05 #16
Willem Ligtenberg <WL*********@gmail.com> wrote:
....
ID --> <Gene-track_geneid>320632</Gene-track_geneid> .... Product-type --> <Entrezgene_type value="protein-coding">6</Entrezgene_type> .... EC --> <Prot-ref_ec>
<Prot-ref_ec_E>1.5.1.5</Prot-ref_ec_E>
<Prot-ref_ec_E>3.5.4.9</Prot-ref_ec_E>
</Prot-ref_ec> ....
Some can happen more than once in a record.


Since all your data are contained in unique tags on individual lines,
you can tackle this so many different ways. Okey, that's your input
format. What is your output format?

--
William Park <op**********@yahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.
Jul 19 '05 #17
Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_str>1234</Object-id_str>
</Object-id>
</Dbtag_tag>
</Dbtag>

And sometimes:

<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>1234</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
If not, I still might need to revert to SAX... :(

On Fri, 22 Apr 2005 15:56:29 +0200, Willem Ligtenberg wrote:
As you can read in the other post of mine, my problem was with the
iterating through the list. didn't know that you should do. e.text. I did
only print e, not print e.text
Did read documentation, but must admit not everything.

Anyway, thank you very much!

On Fri, 22 Apr 2005 15:47:08 +0200, Fredrik Lundh wrote:
Willem Ligtenberg wrote:
As I'm trying to write the code using cElementTree.
I stumble across one problem. Sometimes there are multiple values to
retrieve from one record for the same element. Like this:
<Prot-ref_name_E>ATP-binding cassette, subfamily G, member 1</Prot-ref_name_E>
<Prot-ref_name_E>ATP-binding cassette 8</Prot-ref_name_E>

How do you get not only the first, but the rest as well, so that I can
store it in a list.


findall returns a list of matching elements. if "elem" is the paretnt element,
this gives you a list of the text inside all Prot-ref_name_E child elements:

[e.text for e in elem.findall("Prot-ref_name_E")]

(you have read the elementtree documentation, I hope?)

</F>


Jul 19 '05 #18
Willem Ligtenberg wrote:
So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
why not just check for both alternatives?

text = elem.findtext("Object-id_str")
if text is None:
text = elem.findtext("Object-id_id")

(or you can loop over the child elements and map elem.tag through a
dictionary...)
If not, I still might need to revert to SAX... :(


you still have to check for both alternatives...

(if you find a parsing problem that you cannot solve with a light-weight
DOM, SAX won't help you...)

</F>

Jul 19 '05 #19
Since there are more than one database references possible per record you
should get per record a list of database names, database strings and
databases ids. (where the strings and the id's are really the same thing...)
So per record you check for both alternatives but since there could be
more than one, you do findall and get a (unsorted) list back. And now you
don't know which ID belonged to which database...
See my problem?

Cheers,

Willem

On Fri, 22 Apr 2005 19:38:03 +0200, Fredrik Lundh wrote:
Willem Ligtenberg wrote:
So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID


why not just check for both alternatives?

text = elem.findtext("Object-id_str")
if text is None:
text = elem.findtext("Object-id_id")

(or you can loop over the child elements and map elem.tag through a
dictionary...)
If not, I still might need to revert to SAX... :(


you still have to check for both alternatives...

(if you find a parsing problem that you cannot solve with a light-weight
DOM, SAX won't help you...)

</F>


Jul 19 '05 #20
Willem Ligtenberg wrote:
Since there are more than one database references possible per record you
should get per record a list of database names, database strings and
databases ids. (where the strings and the id's are really the same thing...)
So per record you check for both alternatives but since there could be
more than one, you do findall and get a (unsorted) list back.
findall returns matching elements in document order.
And now you don't know which ID belonged to which database...
why not? by looking at each database separately, surely you must be
able to figure out if the subelement holds an ID or a string? sure, if you
do document.findall(".//Object-id_id"), you'll get all IDs in document
order. but if you do record.findall(".//Dbtag"), you get a list of all Dbtag
elements, and can then look inside them to see what they contain.
See my problem?


I'm afraid not. the document seems to have a clear structure; for some
reason, you don't seem to take that into account in your program.

</F>

Jul 19 '05 #21
order. but if you do record.findall(".//Dbtag"), you get a list of all Dbtag
elements


make that "you get a list of all Dbtag elements in that record"

</F>
Jul 19 '05 #22
Willem Ligtenberg wrote:
Is there an easy way, to couple data together. Because I have discoverd an
irritating feature in the xml file.
Sometimes this is a database reference:
<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_str>1234</Object-id_str>
</Object-id>
</Dbtag_tag>
</Dbtag>

And sometimes:

<Dbtag>
<Dbtag_db>UCSC</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>1234</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>

So I get a list database names and two! lists of ID's
And those two are in no way related. Is there an easy way to create a
dictionary like this DBname --> ID
If not, I still might need to revert to SAX... :(


None of your requirements sound particularly difficult to implement. If you would post a complete
example of the data you want to parse and the data you would like to end up it would be easier to
help you. The sample data you posted originally does not have many of the fields you want to extract
and your example of what you want to end up with is not too clear either.

If you are having trouble with ElementTree I expect you will be completely lost with SAX,
ElementTree is much easier to work with and cElementTree is very fast.

Kent
Jul 19 '05 #23

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
by: Alex Mizrahi | last post by:
Hello, All! i have 3mb long XML document with about 150000 lines (i think it has about 200000 elements there) which i want to parse to DOM to work with. first i thought there will be no...
2
by: GIMME | last post by:
I can't figure an expression needed to parse a string. This problem arrises from parsing Excel csv files ... The expression must parse a string based upon comma delimiters, but if a comma...
2
by: Joey Martin | last post by:
I have been reading documentation on parsing. I need some help though. I have the following in a text file: $650 Number of Bedrooms 3 Air Conditioning? Yes Original Ad SOUTH, 3BR, air, basement....
9
by: gov | last post by:
Hi, I've just started to learn programming and was told this was a good place to ask questions :) Where I work, we receive large quantities of data which is currently all printed on large,...
29
by: zoltan | last post by:
Hi, The scenario is like this : struct ns_rr { const u_char* rdata; }; The rdata field contains some fields such as :
3
by: Rich Shepard | last post by:
I need to learn how to process a byte stream from a form reader where each pair of bytes has meaning according to lookup dictionaries, then use the values to build an array of rows inserted into a...
2
by: RG | last post by:
I am having trouble parsing the data I need from a Serial Port Buffer. I am sending info to a microcontroller that is being echoed back that I need to remove before I start the actual important...
3
by: Damon Getsman | last post by:
Okay so I'm writing a script in python right now as a dirty fix for a problem we're having at work.. Unfortunately this is the first really non-trivial script that I've had to work with in python...
3
by: Phillip B Oldham | last post by:
Hi. I'm stretching my boundaries in programming with a little python shell-script which is going to loop through a list of domain names, grab the whois record, parse it, and put the results into a...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.