473,383 Members | 1,877 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,383 software developers and data experts.

Parsing tag names and values from XML files

Sir,

Could you please assist me in writing a python code for parsing values from XML files. I would like to extract Person, Email, Phone, Organization, Address, etc from the XML file. I have a written a code for it but could you please rectify. Please find the attached MINiML.txt file.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. print "Content-Type: text/plain\n"    
  3. print "<html><body>" 
  4. import xml.dom.minidom
  5.  
  6. # Load the Contibutor collection
  7. MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
  8.  
  9. # Get a list of Contibutors
  10. Contibutors = MINiML.documentElement.getElementsByTagName( 'Contibutor' )
  11.  
  12. # Loop through the Contibutors
  13. for Contibutor in Contibutors:
  14.  
  15.     #Print out the Contibutor's information
  16.     print
  17.     print 'Email:  ' + Contibutor.getElementsByTagName ( 'Email' )[0].childNodes [0].nodeValue
  18.     print 'Phone: ' + Contibutor.getElementsByTagName ( 'Phone' )[0].childNodes [0].nodeValue
  19.     print 'Laboratory:  ' + Contibutor.getElementsByTagName ( 'Laboratory' )[0].childNodes [0].nodeValue
  20.     print 'Department:  ' + Contibutor.getElementsByTagName ( 'Department' ) [0].childNodes [0].nodeValue
  21.     print 'Organization:  ' + Contibutor.getElementsByTagName ( 'Organization' )[0].childNodes [0].nodeValue
  22.     print "</body></html>"
  23.  
Regards,
Haobijam
Attached Files
File Type: txt MINiML.txt (276.7 KB, 752 views)
Nov 18 '10 #1
6 2264
bvdet
2,851 Expert Mod 2GB
To begin with, you misspelled "Contributor". You never got any elements.

Each "Contributor" can have a varying number of ELEMENT_TYPE child nodes. Some of the child nodes can have ELEMENT_TYPE child nodes also.

Note that the first child node can be a Text node with no real text value:
Expand|Select|Wrap|Line Numbers
  1. >>> Contributor.childNodes[0]
  2. <DOM Text node "\n    ">
  3. >>> 
Expand|Select|Wrap|Line Numbers
  1. >>> Contributor
  2. <DOM Element: Contributor at 0x1238ad0>
  3. >>> Contributor.childNodes[1].nodeName
  4. u'Organization'
  5. >>> 
  6.  

Try the following and look at the output. Then decide the best way to get the data you need for printing.
Expand|Select|Wrap|Line Numbers
  1. for Contributor in Contributors:
  2.     for elem in Contributor.childNodes:
  3.         print repr(elem)
  4.         if elem.hasChildNodes:
  5.             for item in elem.childNodes:
  6.                 print "   ", repr(item)
Nov 18 '10 #2
Dear,

Could yo please tell me how could i parse the attributes and its values from the XML file (MINiML.txt). I would like to print output like below -

Contributoriid = "contrib1"
Person Yael Strulovici-Bare
Email yas2003@med.cornell.edu
Phone 646-962-5560
Laboratory Crystal
Department Department of Genetic Medicine
Organization Weill Cornell Medical College
Line 1300 York Avenue
City New York
State NY
Zip-Code 10021
Country USA

Regards,
Haobijam
Nov 19 '10 #3
bvdet
2,851 Expert Mod 2GB
I wrote some functions for an application of mine that you may find useful or give you ideas on how to format the output for your application. The first one returns a list of text found in the child nodes of a parent node. Whitespace is ignored.
Expand|Select|Wrap|Line Numbers
  1. def getTextFromElem(parent):
  2.     '''Return a list of text found in the child nodes of a
  3.     parent node, discarding whitespace.'''
  4.     textList = []
  5.     for n in parent.childNodes:
  6.         # TEXT_NODE - 3
  7.         if n.nodeType == 3 and n.nodeValue.strip():
  8.             textList.append(str(n.nodeValue.strip()))
  9.     return textList
The second returns a list of element nodes below a parent node.
Expand|Select|Wrap|Line Numbers
  1. def getElemChildren(parent):
  2.     # Return a list of element nodes below parent
  3.     elements = []
  4.     for obj in parent.childNodes:
  5.         if obj.nodeType == obj.ELEMENT_NODE:
  6.             elements.append(obj)
  7.     return elements
The third returns a list of strings representing the node tree below a parent node, using recursion to reach nested levels.
Expand|Select|Wrap|Line Numbers
  1. def nodeTree(element, pad=0):
  2.     # Return list of strings representing the node tree below element
  3.     results = ["%s%s" % (pad*" ", str(element.nodeName))]
  4.     nextElems = getElemChildren(element)
  5.     if nextElems:
  6.         for node in nextElems:
  7.             results.extend(nodeTree(node, pad+2))
  8.     else:
  9.         results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
  10.     return results
Using nodeTree() in your application:
Expand|Select|Wrap|Line Numbers
  1. >>> contributors = xmlDoc.documentElement.getElementsByTagName( 'Contributor' )
  2. >>> for contributor in contributors:
  3. ...     print "\n".join(nodeTree(contributor))
  4. ...     
  5. Contributor
  6.   Person
  7.     First
  8.       Yael
  9.     Last
  10.       Strulovici-Barel
  11.   Email
  12.     yas2003@med.cornell.edu
  13.   Phone
  14.     646-962-5560
  15.   Laboratory
  16.     Crystal
  17.   Department
  18.     Department of Genetic Medicine
  19.   Organization
  20.     Weill Cornell Medical College
  21.   Address
  22.     Line
  23.       1300 York Avenue
  24.     City
  25.       New York
  26.     State
  27.       NY
  28.     Zip-Code
  29.       10021
  30.     Country
  31.       USA
  32. Contributor
  33.   Organization
  34.  
  35.   Email
  36.     geo@ncbi.nlm.nih.gov, support@affymetrix.com
  37.   Phone
  38.     888-362-2447
  39.   Organization
  40.     Affymetrix, Inc.
  41.   Address
  42.     City
  43.       Santa Clara
  44.     State
  45.       CA
  46.     Zip-Code
  47.       95051
  48.     Country
  49.       USA
  50.   Web-Link
  51.     http://www.affymetrix.com/index.affx
  52. Contributor
  53.   Person
  54.     First
  55.       Brendan
  56.     Last
  57.       Carolan
  58. Contributor
  59.   Person
  60.     First
  61.       Ben-Gary
  62.     Last
  63.       Harvey
  64. Contributor
  65.   Person
  66.     First
  67.       Bishnu
  68.     Middle
  69.       P
  70.     Last
  71.       De
  72. Contributor
  73.   Person
  74.     First
  75.       Holly
  76.     Last
  77.       Vanni
  78. Contributor
  79.   Person
  80.     First
  81.       Ronald
  82.     Middle
  83.       G
  84.     Last
  85.       Crystal
  86. >>> 
Since we are not extracting attributes, I modified the title of this thread.

BV - Moderator
Nov 19 '10 #4
Hello,

Thanks for your help. I do have assembled and run the script but there was an error while running it on Platform section at line number 86 in MINiML.xml file. When i remove this line and run the script it prints correctly what we want in the output. The error in output prints like -

>>>

Traceback (most recent call last):
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 45, in <module>
print "\n".join(nodeTree(contributor))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 32, in nodeTree
results.extend(nodeTree(node, pad+2))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 34, in nodeTree
results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 15, in getTextFromElem
textList.append(str(n.nodeValue.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 379: ordinal not in range(128)
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import xml.dom.minidom
  3.  
  4. # Load the Contibutor collection
  5. MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
  6.  
  7.  
  8. def getTextFromElem(parent):
  9.     '''Return a list of text found in the child nodes of a
  10.     parent node, discarding whitespace.'''
  11.     textList = []
  12.     for n in parent.childNodes:
  13.         # TEXT_NODE - 3
  14.         if n.nodeType == 3 and n.nodeValue.strip():
  15.             textList.append(str(n.nodeValue.strip()))
  16.     return textList
  17.  
  18. def getElemChildren(parent):
  19.     # Return a list of element nodes below parent
  20.     elements = []
  21.     for obj in parent.childNodes:
  22.         if obj.nodeType == obj.ELEMENT_NODE:
  23.             elements.append(obj)
  24.     return elements
  25.  
  26. def nodeTree(element, pad=0):
  27.     # Return list of strings representing the node tree below element
  28.     results = ["%s%s" % (pad*" ", str(element.nodeName))]
  29.     nextElems = getElemChildren(element)
  30.     if nextElems:
  31.         for node in nextElems:
  32.             results.extend(nodeTree(node, pad+2))
  33.     else:
  34.         results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
  35.     return results
  36.  
  37. contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )
  38. for contributor in contributors:
  39.     print "\n".join(nodeTree(contributor))
  40. contributors = MINiML.documentElement.getElementsByTagName( 'Database' )
  41. for contributor in contributors:
  42.     print "\n".join(nodeTree(contributor))
  43. contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )
  44. for contributor in contributors:
  45.     print "\n".join(nodeTree(contributor))
  46. contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )
  47. for contributor in contributors:
  48.     print "\n".join(nodeTree(contributor))
  49. contributors = MINiML.documentElement.getElementsByTagName( 'Series' )
  50. for contributor in contributors:
  51.     print "\n".join(nodeTree(contributor))
  52.  
Regards,
Haobijam
Nov 21 '10 #5
Hello,

The output for this python code is attached here but the line number 86 in MINiML.xml file is not printed. This an error. Please see the output.

Regards,
Haobijam
Attached Files
File Type: txt output.txt (199.8 KB, 332 views)
Nov 21 '10 #6
bvdet
2,851 Expert Mod 2GB
I think the word "GenBank\xae" is the problem. I'm not sure what to do about that. You might try ElementTree to parse the file.
Nov 21 '10 #7

Sign in to post your reply or Sign up for a free account.

Similar topics

2
by: Jean-Paul Lauque | last post by:
Hello, I'm beginning in the ASP world... I would like to sort (descending) list of files in directory. Parameter is directory url. How I can to do that with ASP not ASP.NET.
8
by: Opa | last post by:
Hi, Does anyone know how to get a list of files for a given directory from a given url. I have the following, but get an error indicating that URI formats are not supported. ...
5
by: jpasqua | last post by:
Is there an XP/SP out there that will return a list of files residing in a specified directory? I'm looking for something simlar to Execute master..xp_subdirs N'C:\' But instead of it...
3
by: Nick | last post by:
Is it possible to read a list of files from a specified directory using VB.net We have company intranet and I have created a page that displays photos from different events. I have coded a page...
9
by: Bill Nguyen | last post by:
I need a VB routine to loop thru a select top folder to find all subfolders and list all subfolders/files under each of these subfolders. Any help is greatly appreciated. Bill
3
by: greg chu | last post by:
If I type dir /s in the command line prompt I can list all system files ex. get into the IE temporary files C:\Documents and Settings\XXXXXX\Local Settings\Temporary Internet Files XXXXXX is...
3
by: cmacn024 | last post by:
Hi folks, I've got a question for yas. I'm trying to write code that will open up a gzipped tar file using gnutar, and copy the list of files(including their directories) to a list variable in...
4
by: maglev_now | last post by:
I'm using .net 1.1 trying to get a list of files in folder on the server. The user would select the file they want to download from a DropDownList. Can someone tell me how this should be done? I...
2
by: Gary42103 | last post by:
Hi I need Perl Script to do Data Parsing using existing data files. I have my existing data files in the following directory: Directory Name: workfs/ams Data File Names: 20070504.dat,...
13
by: haobijam | last post by:
I would like to parse tab separated .txt files separating common attribute and distinct attribute from the file. I would like to parse only the first line attributes not the values. Could you please...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.