473,804 Members | 3,562 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

XML parsing per record

I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.
Jul 19 '05 #1
22 1700
Willem Ligtenberg wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?

Thanks in advance,

Willem Ligtenberg
A total newbie to python by the way.

Read about SAX parsers.
This may be of help:
http://www.devarticles.com/c/a/XML/P...AX-and-Python/

Out of curiousity, why is the data stored in a XML file?
XML is not known for its efficiency....

--Irmen
Jul 19 '05 #2
Irmen de Jong wrote:
XML is not known for its efficiency....


<sarcasm> Surely you are blaspheming, sir! XML's the greatest thing
since peanut butter! </sarcasm>

I'm just *waiting* for the day someone finds its use on the rolls of
toilet paper... oh the glorious day...

Jul 19 '05 #3
Willem Ligtenberg wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics ofcourse :))
But I have no clue how to do that. Most things I see read the entire xml
file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be able
to store the information in another object.
How should I do that?


You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_...ment-generator

Kent
Jul 19 '05 #4
Kent Johnson wrote:
So I would like to parse a XML file one record at a time and then be able
to store the information in another object.


You might be interested in this recipe using ElementTree:
http://online.effbot.org/2004_12_01_...ment-generator


if you have ElementTree 1.2.5 or later, the "iterparse" function provides a
more efficient implementation of that pattern:

http://effbot.org/zone/element-iterparse.htm

the cElementTree implemention of "iterparse" is a lot faster than SAX; see
the second table under

http://effbot.org/zone/celementtree.htm#benchmarks

for some figures.

</F>

Jul 19 '05 #5
Willem Ligtenberg <WL*********@gm ail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.


You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?

--
William Park <op**********@y ahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.

Jul 19 '05 #6
William Park wrote:
You may want to try Expat (www.libexpat.org) or Python wrapper to it.


Python comes with a low-level expat wrapper (pyexpat).

however, if you want performance, cElementTree (which also uses expat) is a
lot faster than pyexpat. (see my other post for links to benchmarks and code).

</F>

Jul 19 '05 #7
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
Willem Ligtenberg <WL*********@gm ail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?

Thanks in advance,

Willem Ligtenberg A total newbie to python by the way.


You may want to try Expat (www.libexpat.org) or Python wrapper to it.
You can feed small piece at a time, say by lines or whatever. Of
course, it all depends on what kind of parsing you have in mind. :-)

Care to post more details?


The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgen e.dtd">
<Entrezgene-Set>
<Entrezgene>
<Entrezgene_tra ck-info>
<Gene-track>
<Gene-track_geneid>99 96</Gene-track_geneid>
<Gene-track_status value="secondar y">1</Gene-track_status>
<Gene-track_current-id>
<Dbtag>
<Dbtag_db>Locus ID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
<Dbtag>
<Dbtag_db>GeneI D</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>320632</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Gene-track_current-id>
<Gene-track_create-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2003</Date-std_year>
<Date-std_month>8</Date-std_month>
<Date-std_day>28</Date-std_day>
<Date-std_hour>21</Date-std_hour>
<Date-std_minute>39</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_create-date>
<Gene-track_update-date>
<Date>
<Date_std>
<Date-std>
<Date-std_year>2005</Date-std_year>
<Date-std_month>2</Date-std_month>
<Date-std_day>17</Date-std_day>
<Date-std_hour>12</Date-std_hour>
<Date-std_minute>54</Date-std_minute>
<Date-std_second>0</Date-std_second>
</Date-std>
</Date_std>
</Date>
</Gene-track_update-date>
</Gene-track>
</Entrezgene_trac k-info>
<Entrezgene_typ e value="protein-coding">6</Entrezgene_type >
<Entrezgene_sou rce>
<BioSource>
<BioSource_geno me value="genomic" >1</BioSource_genom e>
<BioSource_orig in value="natural" >1</BioSource_origi n>
<BioSource_or g>
<Org-ref>
<Org-ref_taxname>Mus musculus</Org-ref_taxname>
<Org-ref_common>hous e mouse</Org-ref_common>
<Org-ref_db>
<Dbtag>
<Dbtag_db>taxon </Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>10090</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Org-ref_db>
<Org-ref_syn>
<Org-ref_syn_E>mouse </Org-ref_syn_E>
</Org-ref_syn>
<Org-ref_orgname>
<OrgName>
<OrgName_name >
<OrgName_name_b inomial>
<BinomialOrgNam e>
<BinomialOrgNam e_genus>Mus</BinomialOrgName _genus>
<BinomialOrgNam e_species>muscu lus</BinomialOrgName _species>
</BinomialOrgName >
</OrgName_name_bi nomial>
</OrgName_name>
<OrgName_lineag e>Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglire s; Glires; Rodentia; Sciurognathi; Muridae; Murinae; Mus</OrgName_lineage >
<OrgName_gcode> 1</OrgName_gcode>
<OrgName_mgcode >2</OrgName_mgcode>
<OrgName_div>RO D</OrgName_div>
</OrgName>
</Org-ref_orgname>
</Org-ref>
</BioSource_org>
</BioSource>
</Entrezgene_sour ce>
<Entrezgene_gen e>
<Gene-ref>
</Gene-ref>
</Entrezgene_gene >
<Entrezgene_gen e-source>
<Gene-source>
<Gene-source_src>Locu sLink</Gene-source_src>
<Gene-source_src-int>9996</Gene-source_src-int>
<Gene-source_src-str2>9996</Gene-source_src-str2>
<Gene-source_gene-display value="false"/>
<Gene-source_locus-display value="false"/>
<Gene-source_extra-terms value="false"/>
</Gene-source>
</Entrezgene_gene-source>
<Entrezgene_loc us>
<Gene-commentary>
<Gene-commentary_type value="genomic" >1</Gene-commentary_type >
<Gene-commentary_vers ion>0</Gene-commentary_vers ion>
</Gene-commentary>
</Entrezgene_locu s>
<Entrezgene_uni que-keys>
<Dbtag>
<Dbtag_db>Locus ID</Dbtag_db>
<Dbtag_tag>
<Object-id>
<Object-id_id>9996</Object-id_id>
</Object-id>
</Dbtag_tag>
</Dbtag>
</Entrezgene_uniq ue-keys>
<Entrezgene_xtr a-index-terms>
<Entrezgene_xtr a-index-terms_E>LOC3206 32</Entrezgene_xtra-index-terms_E>
</Entrezgene_xtra-index-terms>
</Entrezgene>
</Entrezgene-Set>

Jul 19 '05 #8
Willem Ligtenberg wrote:
Willem Ligtenberg <WL*********@gm ail.com> wrote:
I want to parse a very large (2.4 gig) XML file (bioinformatics
ofcourse :)) But I have no clue how to do that. Most things I see read
the entire xml file at once. That isn't going to work here ofcourse.

So I would like to parse a XML file one record at a time and then be
able to store the information in another object. How should I do
that?


The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.
This is an example of the XML
<?xml version="1.0"?>
<!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgen e.dtd">
<Entrezgene-Set>
<Entrezgene>
<snip>
</Entrezgene>
</Entrezgene-Set>


This should get you started with cElementTree:

import cElementTree as ElementTree

source = 'Entrezgene.xml '

for event, elem in ElementTree.ite rparse(source):
if elem.tag == 'Entrezgene':
# Process the Entrezgene element
geneid = elem.findtext(' Entrezgene_trac k-info/Gene-track/Gene-track_geneid')
print 'Gene id', geneid

# Throw away the element, we're done with it
elem.clear()

Kent
Jul 19 '05 #9
Willem Ligtenberg <WL*********@gm ail.com> wrote:
On Sun, 17 Apr 2005 02:16:04 +0000, William Park wrote:
Care to post more details?
The XML file I need to parse contains information about genes.
So the first element is a gene and then there are a lot sub-elements with
sub-elements. I only need some of the informtion and want to store it in
my an object called gene. Lateron this information will be printed into a
file, which in it's turn will be fed into some other program.


You have to help us a little more here. Which info do you want to
extract from below example?
<Entrezgene-Set>
...
</Entrezgene-Set>


--
William Park <op**********@y ahoo.ca>, Toronto, Canada
Slackware Linux -- because it works.
Jul 19 '05 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

19
2032
by: Alex Mizrahi | last post by:
Hello, All! i have 3mb long XML document with about 150000 lines (i think it has about 200000 elements there) which i want to parse to DOM to work with. first i thought there will be no problems, but there were.. first i tried Python.. there's special interest group that wants to "make Python become the premier language for XML processing" so i thought there will be no problems with this document.. i used xml.dom.minidom to parse it.....
2
17948
by: GIMME | last post by:
I can't figure an expression needed to parse a string. This problem arrises from parsing Excel csv files ... The expression must parse a string based upon comma delimiters, but if a comma appears in double quotes it should not be used for parsing. For example in the simple case we'd have : $a='hello,brave,world';
2
3044
by: Joey Martin | last post by:
I have been reading documentation on parsing. I need some help though. I have the following in a text file: $650 Number of Bedrooms 3 Air Conditioning? Yes Original Ad SOUTH, 3BR, air, basement. $650. Call 278-4171. First Appeared in the Newspaper Thursday, October 30, 2003 $775
9
1679
by: gov | last post by:
Hi, I've just started to learn programming and was told this was a good place to ask questions :) Where I work, we receive large quantities of data which is currently all printed on large, obsolete, dot matrix printers. This is a problem because the replacement parts will not be available for much longer. So I'm trying to create a program which will capture the fixed width
29
4272
by: zoltan | last post by:
Hi, The scenario is like this : struct ns_rr { const u_char* rdata; }; The rdata field contains some fields such as :
3
2523
by: Rich Shepard | last post by:
I need to learn how to process a byte stream from a form reader where each pair of bytes has meaning according to lookup dictionaries, then use the values to build an array of rows inserted into a sqlite3 database table. Here's the context: The OMR card reader sends a stream of 69 bytes over the serial line; the last byte is a carriage return ('\r') indicating the end of record. Three pairs (in specific positions at the beginning of the...
2
4889
by: RG | last post by:
I am having trouble parsing the data I need from a Serial Port Buffer. I am sending info to a microcontroller that is being echoed back that I need to remove before I start the actual important data reading. For instance this is my buffer string: 012301234FFFFFFxFFFFFFxFFFFFFx Where the FFFFFF is my Hex data I need to read. I am using the "x" as a separater as I was having problems using the VbCrLf. But I think
3
1882
by: Damon Getsman | last post by:
Okay so I'm writing a script in python right now as a dirty fix for a problem we're having at work.. Unfortunately this is the first really non-trivial script that I've had to work with in python and the book that I have on it really kind of sucks. I'm having an issue parsing lines of 'last' output that I have stored in a /tmp file. The first time it does a .readline() I get the full line of output, which I'm then able to split() and...
3
4253
by: Phillip B Oldham | last post by:
Hi. I'm stretching my boundaries in programming with a little python shell-script which is going to loop through a list of domain names, grab the whois record, parse it, and put the results into a csv. I've got the results coming back fine, but since I have *no* experience with python I'm wondering what would be the preferred "pythonic" way of parsing the whois string into a csv record. Tips/thoughts/examples more than welcome!
0
10580
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10323
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10082
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
9157
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
6854
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5652
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4301
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3821
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2993
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.