By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,772 Members | 937 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,772 IT Pros & Developers. It's quick & easy.

Parsing tag names and values from XML files

P: 16
Sir,

Could you please assist me in writing a python code for parsing values from XML files. I would like to extract Person, Email, Phone, Organization, Address, etc from the XML file. I have a written a code for it but could you please rectify. Please find the attached MINiML.txt file.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. print "Content-Type: text/plain\n"    
  3. print "<html><body>" 
  4. import xml.dom.minidom
  5.  
  6. # Load the Contibutor collection
  7. MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
  8.  
  9. # Get a list of Contibutors
  10. Contibutors = MINiML.documentElement.getElementsByTagName( 'Contibutor' )
  11.  
  12. # Loop through the Contibutors
  13. for Contibutor in Contibutors:
  14.  
  15.     #Print out the Contibutor's information
  16.     print
  17.     print 'Email:  ' + Contibutor.getElementsByTagName ( 'Email' )[0].childNodes [0].nodeValue
  18.     print 'Phone: ' + Contibutor.getElementsByTagName ( 'Phone' )[0].childNodes [0].nodeValue
  19.     print 'Laboratory:  ' + Contibutor.getElementsByTagName ( 'Laboratory' )[0].childNodes [0].nodeValue
  20.     print 'Department:  ' + Contibutor.getElementsByTagName ( 'Department' ) [0].childNodes [0].nodeValue
  21.     print 'Organization:  ' + Contibutor.getElementsByTagName ( 'Organization' )[0].childNodes [0].nodeValue
  22.     print "</body></html>"
  23.  
Regards,
Haobijam
Attached Files
File Type: txt MINiML.txt (276.7 KB, 651 views)
Nov 18 '10 #1
Share this Question
Share on Google+
6 Replies


bvdet
Expert Mod 2.5K+
P: 2,851
To begin with, you misspelled "Contributor". You never got any elements.

Each "Contributor" can have a varying number of ELEMENT_TYPE child nodes. Some of the child nodes can have ELEMENT_TYPE child nodes also.

Note that the first child node can be a Text node with no real text value:
Expand|Select|Wrap|Line Numbers
  1. >>> Contributor.childNodes[0]
  2. <DOM Text node "\n    ">
  3. >>> 
Expand|Select|Wrap|Line Numbers
  1. >>> Contributor
  2. <DOM Element: Contributor at 0x1238ad0>
  3. >>> Contributor.childNodes[1].nodeName
  4. u'Organization'
  5. >>> 
  6.  

Try the following and look at the output. Then decide the best way to get the data you need for printing.
Expand|Select|Wrap|Line Numbers
  1. for Contributor in Contributors:
  2.     for elem in Contributor.childNodes:
  3.         print repr(elem)
  4.         if elem.hasChildNodes:
  5.             for item in elem.childNodes:
  6.                 print "   ", repr(item)
Nov 18 '10 #2

P: 16
Dear,

Could yo please tell me how could i parse the attributes and its values from the XML file (MINiML.txt). I would like to print output like below -

Contributoriid = "contrib1"
Person Yael Strulovici-Bare
Email yas2003@med.cornell.edu
Phone 646-962-5560
Laboratory Crystal
Department Department of Genetic Medicine
Organization Weill Cornell Medical College
Line 1300 York Avenue
City New York
State NY
Zip-Code 10021
Country USA

Regards,
Haobijam
Nov 19 '10 #3

bvdet
Expert Mod 2.5K+
P: 2,851
I wrote some functions for an application of mine that you may find useful or give you ideas on how to format the output for your application. The first one returns a list of text found in the child nodes of a parent node. Whitespace is ignored.
Expand|Select|Wrap|Line Numbers
  1. def getTextFromElem(parent):
  2.     '''Return a list of text found in the child nodes of a
  3.     parent node, discarding whitespace.'''
  4.     textList = []
  5.     for n in parent.childNodes:
  6.         # TEXT_NODE - 3
  7.         if n.nodeType == 3 and n.nodeValue.strip():
  8.             textList.append(str(n.nodeValue.strip()))
  9.     return textList
The second returns a list of element nodes below a parent node.
Expand|Select|Wrap|Line Numbers
  1. def getElemChildren(parent):
  2.     # Return a list of element nodes below parent
  3.     elements = []
  4.     for obj in parent.childNodes:
  5.         if obj.nodeType == obj.ELEMENT_NODE:
  6.             elements.append(obj)
  7.     return elements
The third returns a list of strings representing the node tree below a parent node, using recursion to reach nested levels.
Expand|Select|Wrap|Line Numbers
  1. def nodeTree(element, pad=0):
  2.     # Return list of strings representing the node tree below element
  3.     results = ["%s%s" % (pad*" ", str(element.nodeName))]
  4.     nextElems = getElemChildren(element)
  5.     if nextElems:
  6.         for node in nextElems:
  7.             results.extend(nodeTree(node, pad+2))
  8.     else:
  9.         results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
  10.     return results
Using nodeTree() in your application:
Expand|Select|Wrap|Line Numbers
  1. >>> contributors = xmlDoc.documentElement.getElementsByTagName( 'Contributor' )
  2. >>> for contributor in contributors:
  3. ...     print "\n".join(nodeTree(contributor))
  4. ...     
  5. Contributor
  6.   Person
  7.     First
  8.       Yael
  9.     Last
  10.       Strulovici-Barel
  11.   Email
  12.     yas2003@med.cornell.edu
  13.   Phone
  14.     646-962-5560
  15.   Laboratory
  16.     Crystal
  17.   Department
  18.     Department of Genetic Medicine
  19.   Organization
  20.     Weill Cornell Medical College
  21.   Address
  22.     Line
  23.       1300 York Avenue
  24.     City
  25.       New York
  26.     State
  27.       NY
  28.     Zip-Code
  29.       10021
  30.     Country
  31.       USA
  32. Contributor
  33.   Organization
  34.  
  35.   Email
  36.     geo@ncbi.nlm.nih.gov, support@affymetrix.com
  37.   Phone
  38.     888-362-2447
  39.   Organization
  40.     Affymetrix, Inc.
  41.   Address
  42.     City
  43.       Santa Clara
  44.     State
  45.       CA
  46.     Zip-Code
  47.       95051
  48.     Country
  49.       USA
  50.   Web-Link
  51.     http://www.affymetrix.com/index.affx
  52. Contributor
  53.   Person
  54.     First
  55.       Brendan
  56.     Last
  57.       Carolan
  58. Contributor
  59.   Person
  60.     First
  61.       Ben-Gary
  62.     Last
  63.       Harvey
  64. Contributor
  65.   Person
  66.     First
  67.       Bishnu
  68.     Middle
  69.       P
  70.     Last
  71.       De
  72. Contributor
  73.   Person
  74.     First
  75.       Holly
  76.     Last
  77.       Vanni
  78. Contributor
  79.   Person
  80.     First
  81.       Ronald
  82.     Middle
  83.       G
  84.     Last
  85.       Crystal
  86. >>> 
Since we are not extracting attributes, I modified the title of this thread.

BV - Moderator
Nov 19 '10 #4

P: 16
Hello,

Thanks for your help. I do have assembled and run the script but there was an error while running it on Platform section at line number 86 in MINiML.xml file. When i remove this line and run the script it prints correctly what we want in the output. The error in output prints like -

>>>

Traceback (most recent call last):
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 45, in <module>
print "\n".join(nodeTree(contributor))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 32, in nodeTree
results.extend(nodeTree(node, pad+2))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 34, in nodeTree
results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 15, in getTextFromElem
textList.append(str(n.nodeValue.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 379: ordinal not in range(128)
Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/python
  2. import xml.dom.minidom
  3.  
  4. # Load the Contibutor collection
  5. MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
  6.  
  7.  
  8. def getTextFromElem(parent):
  9.     '''Return a list of text found in the child nodes of a
  10.     parent node, discarding whitespace.'''
  11.     textList = []
  12.     for n in parent.childNodes:
  13.         # TEXT_NODE - 3
  14.         if n.nodeType == 3 and n.nodeValue.strip():
  15.             textList.append(str(n.nodeValue.strip()))
  16.     return textList
  17.  
  18. def getElemChildren(parent):
  19.     # Return a list of element nodes below parent
  20.     elements = []
  21.     for obj in parent.childNodes:
  22.         if obj.nodeType == obj.ELEMENT_NODE:
  23.             elements.append(obj)
  24.     return elements
  25.  
  26. def nodeTree(element, pad=0):
  27.     # Return list of strings representing the node tree below element
  28.     results = ["%s%s" % (pad*" ", str(element.nodeName))]
  29.     nextElems = getElemChildren(element)
  30.     if nextElems:
  31.         for node in nextElems:
  32.             results.extend(nodeTree(node, pad+2))
  33.     else:
  34.         results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
  35.     return results
  36.  
  37. contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )
  38. for contributor in contributors:
  39.     print "\n".join(nodeTree(contributor))
  40. contributors = MINiML.documentElement.getElementsByTagName( 'Database' )
  41. for contributor in contributors:
  42.     print "\n".join(nodeTree(contributor))
  43. contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )
  44. for contributor in contributors:
  45.     print "\n".join(nodeTree(contributor))
  46. contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )
  47. for contributor in contributors:
  48.     print "\n".join(nodeTree(contributor))
  49. contributors = MINiML.documentElement.getElementsByTagName( 'Series' )
  50. for contributor in contributors:
  51.     print "\n".join(nodeTree(contributor))
  52.  
Regards,
Haobijam
Nov 21 '10 #5

P: 16
Hello,

The output for this python code is attached here but the line number 86 in MINiML.xml file is not printed. This an error. Please see the output.

Regards,
Haobijam
Attached Files
File Type: txt output.txt (199.8 KB, 267 views)
Nov 21 '10 #6

bvdet
Expert Mod 2.5K+
P: 2,851
I think the word "GenBank\xae" is the problem. I'm not sure what to do about that. You might try ElementTree to parse the file.
Nov 21 '10 #7

Post your reply

Sign in to post your reply or Sign up for a free account.