473,503 Members | 1,635 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Parsing XML and storing attributes in MySQL

Hello,

I am trying to store huge amounts of data from xml files and put them
into a MySQL database. The xml files all are in this format:

<?xml version="1.0" encoding="UTF-8"?>
<snp_submission xsi:schemaLocation="http://www.hapmap.org
http://hapmap.cshl.org/xml-schema/2003-1
1-04/hapmap.xsd"
lsid="urn:LSID:ncbi.nlm.nih.gov:dbSNP/DCC_batch:34_chrom11_DELETE:005"
xmlns="http
://www.hapmap.org"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<batch_info>
<class>DELETE</class>
<group lsid="urn:lsid:dcc.hapmap.cshl.org:Labgroup:DBSNP: 1"/>
<date_created>2005-09-15</date_created>
<contact>
<name>Steve Sherry</name>
<email>sh****@ncbi.nlm.nih.gov</email>
</contact>
</batch_info>
<snps>
<snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1544:111">
<snp_class>
<bac-overlap/>
</snp_class>
<sequence>

<flank_5>caaaggaatataaatcattctattataaagatacatgcacA GGgctgggtgcagtggctcacacctgtaatccc
agcactttgggaggccaaggcgggtggatcacctgaggacaggagtttga gaccagcctagccaacatggggaaactccatctctactaaaaatacaaa
aattagccaggtatagtggtgcacacctgtaataccagctactttggagg ctgaggcaggagaatcgctggaacccaggaggcagaggtcaaagtgagc
caagatcataccattgcactccagcctgggcaacaagagcaaaactccat cttaaaaaaatatatatatatacatatacatacatatatatacacatat
atatacatatatacagatattatatatgtaaatgtatatatatgtgtata tatatacacatatatatacatattataactacatatatatacacacaca
catacatatacatgcacacatatgtttattgcagcactatttacgataga aaatacatggaatcctcccaaatgcccatcaatgatatattggataaag
aaaatgtgatatatattcaccatggaatactatgcagccgttaaaataaa tgagatcatgttctttgcagggacatggatgaagctggaagccatcacc
ctcagcaaactaacacaggaaaagaaaaccaaacaccacatgttctcagt cgtaagagggagttgaacaatgagagcaaacacatggatacatggaggg
gaacaacacacaccagggcctctcagcgggacaggggtaggagaCCATCA GGACAaacacgtggatacatggaggggaacaacacacaccagggcctct
cagggggacagggggtaggagaccatcaagacaaacacgtggatacatgg aggggaacaacacacaccagggcctctcagggggacagggggtaggaga
ccatcaggacaaacacgtggatacatggaggggaa</flank_5>
<variation>
<allele base="C"/>
<allele base="T"/>
</variation>

<flank_3>aacacacaccagggcctctcagggggacagggggtaggaga ccatcaRgacaaacacgtggRtacatggagggg
aacaacacacaccagggcctctcagggggacggggggtagRagaccatca ggacaaatagctaatgcatgcagggcctcatacctaggtgatgggttga
tgggtgcagcaaaccaccatggcacacatttacctatgtatcaaacctaY actttctgcacgtgtatcccagaacataaaataaaatttaaaaaatata
taCACTGATTCATGATCTCCTttctctccttctgaaacactctttaaaac tttttagcatttccccctctgtcttccatgtctcctaactacatgtttc
ttattttccatgtctttattcctgtgttcattttggatagccccttctga cctatattacagtttactagttcactcttcaactgcttctaacatacta
atattctgttaaaaccattcatttgggtttaaatttcaattatgttattc tctatggacattctatttgttttcttttaatcttcttggccattctcta
gagtttcctgttccattatgatatttttaattttttgttttactttaaac atactaaatatagttattttattttattttctgtatctgatactttcaa
taactgcagtctttgctagtcttttttctgtgctcttgctcatagttttt ttcatttgttttCATGATTagaaaaacagagagagaagaaggagagtaa
agggaggaggcggaggaggagaaaagaagaaagcagagaagaagggacag agaaaaaaaggaagTTGGTTCTAACGTTTCTCTAACAACTGGCTTCAGT
GAAACACTCCCACCTTGTGGATTTTTAGGTTATTGAAATTAACCAGTCTT Ctgggtgcagcacaccaacatggcacatgtatacatatgtaacaaacct
gcactttgtgcacatgtaccctaaaacttaaagta</flank_3>
<length_class value="full-length"/>
</sequence>
<genomic_locations>
<genomic_location active="true" type="exact">
<assembly_version>ncbi_build34</assembly_version>
<coordinates chrom="11" start="60749" stop="60749"
strand="forward"/>
</genomic_location>
</genomic_locations>
<neighbour_snps>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1545:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1546:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1547:101"/>
<neighbour_snp
lsid="urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs268 6858:100"/>
</neighbour_snps>
</snp>
I am trying to extract the following information: lsid, allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::Parser since the information I am looking for is
embedded in the element's attributes.

If anyone has any ideas on how to get this data out of xml and into
MySQL, it would be much appreciated.

Thanks so much,

--James

Jun 12 '06 #1
3 1898
jkugler wrote:
I am trying to store huge amounts of data from xml files and put them
into a MySQL database. The xml files all are in this format:
This is genomic data, so you probably have
GigaBytes of this stuff. Very few tools are
capable of handling such amounts in acceptable
time with acceptable resource usage.
I am trying to extract the following information: lsid, allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::Parser since the information I am looking for is
embedded in the element's attributes.
Have a look at what Andrew Schorr did to store
GigaByte of XML data into PostgreSQL:

http://home.vrweb.de/~juergen.kahrs/...nto-PostgreSQL

There is currently no interface for connecting
to MySQL. But Andrew explained how such an interface
should be implemented, when he answered a similar
question today in comp.lang.awk:
Please take a look at the PostgreSQL API that I implemented
for xgawk: http://sourceforge.net/projects/xmlgawk. I think
a Mysql extension next would be nice.

I had thought of building a general xgawk database API
over the libdbi library (http://sourceforge.net/projects/libdbi),
but I found that the libdbi API did not offer all of the PostgreSQL
API features that I needed to access. But as libdbi matures,
that may be a good layer to build upon.

But xgawk would certainly be a great platform for you to
use in building a new database access mechanism. Contributions
are welcome.

Jun 12 '06 #2
If you're dealing with gigabytes of data, you may want to look at IBM's
new XML capabilities in DB2... or at "exploding" the XML into a standard
database representation and operating on that. (XML is a good
interchange format, but often is not the best back-end representation.)

--
() ASCII Ribbon Campaign | Joe Kesselman
/\ Stamp out HTML e-mail! | System architexture and kinetic poetry
Jun 12 '06 #3
jkugler wrote:
I am trying to extract the following information: lsid, allele base,
cooordinates chrom, start, and strand. I would ideally like to use a
I just had a look at the data. This is a structural
outline of your XML snippet:

snps
snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1544:111'
snp_class
bac-overlap
sequence
flank_5
variation
allele base='C'
allele base='T'
flank_3
length_class value='full-length'
genomic_locations
genomic_location active='true' type='exact'
assembly_version
coordinates chrom='11' start='60749' stop='60749' strand='forward'
neighbour_snps
neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1545:101'
neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1546:101'
neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs246 1547:101'
neighbour_snp lsid='urn:lsid:ncbi.nlm.nih.gov:dbSNP;refSNP:rs268 6858:100'

Why don't you convert it to CSV format and import
the CSV data into your data base ?
perl script to pull this data out and put it in MySQL, but I have had
no luck with XML::Parser since the information I am looking for is
embedded in the element's attributes.


You probably mean lines likes this one:

coordinates chrom='11' start='60749' stop='60749' strand='forward'

This is particularly easy to handle in XMLgawk.
I guess it would take about 10 to 20 lines of
XMLgawk to convert your data to CSV format.
Jun 12 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
1397
by: Federico Bari | last post by:
Hi all, I'm a bigginer using XML with Perl or PHP; I have to manage xml files storing datas of more than one table (inside the same xml file), and use it to update a mySQL database (the xml...
16
2852
by: Terry | last post by:
Hi, This is a newbie's question. I want to preload 4 images and only when all 4 images has been loaded into browser's cache, I want to start a slideshow() function. If images are not completed...
2
2103
by: fizzy | last post by:
i am fetching an xml document with the following structure: <?xml version="1.0" encoding="UTF-8"?> <DTCResponse xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"...
9
4042
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
1
3662
by: Martin Pöpping | last post by:
Hello, I´ve a problem with parsing a double value from an xml file. My code looks like this: int concept_id; double rank; XmlElement root = documentXMLString.DocumentElement; XmlNodeList...
4
2052
by: Falcolas | last post by:
I personally feel that it's a bad idea. I've got to convince managerial types of this. Would some of you mind posting comments for and against the concept of storing data in XML attributes? If...
5
7704
by: PatlaDJ | last post by:
Java SAX parser, please need a clue how to get the raw XML code of the currently parsing event... needed for logging, debugging purposes. Here's and example, letting me clarify exactly what i...
1
2187
nine72
by: nine72 | last post by:
Ok, I am at a complete loss on this and have finally come to the XML Parsing Gods (and perhaps a PHP minor deity) for guidance… I will try my best to describe what I have going on… 1) I have 15...
9
6265
nine72
by: nine72 | last post by:
Ok, so I have figured out how to parse an custom returned XML doc (actually a string that I made into a doc for testing). At this point I am attempting to integrate my parse routine into my main...
0
7202
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7086
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
7280
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7332
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
5578
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
3154
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
1512
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
736
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
0
382
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.