On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
Willem Ligtenberg <WL*********@gm ail.com> wrote: I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.
So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?
Thanks in advance,
Willem Ligtenberg A total newbie to python by the way.
You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)
Care to post more details?
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgen e.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_tra ck-info>
<Gene-track>
<Gene-track_geneid>99 96</Gene-track_geneid>
<Gene-track_status value="secondar y">1</Gene-track_status>
<Gene-track_current-id>
<Dbtag>
<Dbtag_db>Locus ID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
<Dbtag>
<Dbtag_db>GeneI D</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Gene-track_current-id>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>8</Date-std_month>
<Date-std_day>28</Date-std_day>
<Date-std_hour>21</Date-std_hour>
<Date-std_minute>39</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2005</Date-std_year>
<Date-std_month>2</Date-std_month>
<Date-std_day>17</Date-std_day>
<Date-std_hour>12</Date-std_hour>
<Date-std_minute>54</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_trac k-info>
<Entrezgene_typ e value="protein-coding">6</Entrezgene_type >
<Entrezgene_sou rce>
<BioSource>
<BioSource_geno me value="genomic" >1</BioSource_genom e>
<BioSource_orig in value="natural" >1</BioSource_origi n>
<BioSource_or g>
<Org-ref>
<Org-ref_taxname>Mus musculus</Org-ref_taxname>
<Org-ref_common>hous e mouse</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon </Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>10090</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_syn>
<Org-ref_syn_E>mouse </Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name >
<OrgName_name_b inomial>
<BinomialOrgNam e>
<BinomialOrgNam e_genus>Mus</BinomialOrgName _genus>
<BinomialOrgNam e_species>muscu lus</BinomialOrgName _species>
</BinomialOrgName >
</OrgName_name_bi nomial>
</OrgName_name>
<OrgName_lineag e>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglire s; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus</OrgName_lineage >
<OrgName_gcode> 1</OrgName_gcode>
<OrgName_mgcode >2</OrgName_mgcode>
<OrgName_div>RO D</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
</BioSource>
</Entrezgene_sour ce>
<Entrezgene_gen e>
<Gene-ref>
</Gene-ref>
</Entrezgene_gene >
<Entrezgene_gen e-source>
<Gene-source>
<Gene-source_src>Locu sLink</Gene-source_src>
<Gene-source_src-int>9996</Gene-source_src-int>
<Gene-source_src-str2>9996</Gene-source_src-str2>
<Gene-source_gene-display value="false"/>
<Gene-source_locus-display value="false"/>
<Gene-source_extra-terms value="false"/>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_loc us>
<Gene-commentary>
<Gene-commentary_type value="genomic" >1</Gene-commentary_type >
<Gene-commentary_vers ion>0</Gene-commentary_vers ion>
</Gene-commentary>
</Entrezgene_locu s>
<Entrezgene_uni que-keys>
<Dbtag>
<Dbtag_db>Locus ID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>9996</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Entrezgene_uniq ue-keys>
<Entrezgene_xtr a-index-terms>
<Entrezgene_xtr a-index-terms_E>LOC3206 32</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>