Parsing tag names and values from XML files

Sir,

Could you please assist me in writing a python code for parsing values from XML files. I would like to extract Person, Email, Phone, Organization, Address, etc from the XML file. I have a written a code for it but could you please rectify. Please find the attached MINiML.txt file.

Expand|Select|Wrap|Line Numbers

 
#!/usr/bin/python

print "Content-Type: text/plain\n"    

print "<html><body>" 

import xml.dom.minidom
 
# Load the Contibutor collection

MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
 
# Get a list of Contibutors

Contibutors = MINiML.documentElement.getElementsByTagName( 'Contibutor' )
 
# Loop through the Contibutors

for Contibutor in Contibutors:
 
    #Print out the Contibutor's information

    print

    print 'Email:  ' + Contibutor.getElementsByTagName ( 'Email' )[0].childNodes [0].nodeValue

    print 'Phone: ' + Contibutor.getElementsByTagName ( 'Phone' )[0].childNodes [0].nodeValue

    print 'Laboratory:  ' + Contibutor.getElementsByTagName ( 'Laboratory' )[0].childNodes [0].nodeValue

    print 'Department:  ' + Contibutor.getElementsByTagName ( 'Department' ) [0].childNodes [0].nodeValue

    print 'Organization:  ' + Contibutor.getElementsByTagName ( 'Organization' )[0].childNodes [0].nodeValue

    print "</body></html>"

Regards,
Haobijam

Attached Files

MINiML.txt (276.7 KB, 752 views)

Nov 18 '10 #1

Subscribe Post Reply

2264

bvdet

2,851

Expert Mod 2GB

To begin with, you misspelled "Contributor". You never got any elements.

Each "Contributor" can have a varying number of ELEMENT_TYPE child nodes. Some of the child nodes can have ELEMENT_TYPE child nodes also.

Note that the first child node can be a Text node with no real text value:

Expand|Select|Wrap|Line Numbers

 >>> Contributor.childNodes[0]

<DOM Text node "\n    ">

>>>

Expand|Select|Wrap|Line Numbers

 
>>> Contributor

<DOM Element: Contributor at 0x1238ad0>

>>> Contributor.childNodes[1].nodeName

u'Organization'

>>>

Try the following and look at the output. Then decide the best way to get the data you need for printing.

Expand|Select|Wrap|Line Numbers

 for Contributor in Contributors:

    for elem in Contributor.childNodes:

        print repr(elem)

        if elem.hasChildNodes:

            for item in elem.childNodes:

                print "   ", repr(item)

Nov 18 '10 #2

haobijam

Dear,

Could yo please tell me how could i parse the attributes and its values from the XML file (MINiML.txt). I would like to print output like below -

Contributoriid = "contrib1"
Person Yael Strulovici-Bare
Email yas2003@med.cornell.edu
Phone 646-962-5560
Laboratory Crystal
Department Department of Genetic Medicine
Organization Weill Cornell Medical College
Line 1300 York Avenue
City New York
State NY
Zip-Code 10021
Country USA

Regards,
Haobijam

Nov 19 '10 #3

bvdet

2,851

Expert Mod 2GB

I wrote some functions for an application of mine that you may find useful or give you ideas on how to format the output for your application. The first one returns a list of text found in the child nodes of a parent node. Whitespace is ignored.

Expand|Select|Wrap|Line Numbers

 def getTextFromElem(parent):

    '''Return a list of text found in the child nodes of a

    parent node, discarding whitespace.'''

    textList = []

    for n in parent.childNodes:

        # TEXT_NODE - 3

        if n.nodeType == 3 and n.nodeValue.strip():

            textList.append(str(n.nodeValue.strip()))

    return textList

The second returns a list of element nodes below a parent node.

Expand|Select|Wrap|Line Numbers

 def getElemChildren(parent):

    # Return a list of element nodes below parent

    elements = []

    for obj in parent.childNodes:

        if obj.nodeType == obj.ELEMENT_NODE:

            elements.append(obj)

    return elements

The third returns a list of strings representing the node tree below a parent node, using recursion to reach nested levels.

Expand|Select|Wrap|Line Numbers

 def nodeTree(element, pad=0):

    # Return list of strings representing the node tree below element

    results = ["%s%s" % (pad*" ", str(element.nodeName))]

    nextElems = getElemChildren(element)

    if nextElems:

        for node in nextElems:

            results.extend(nodeTree(node, pad+2))

    else:

        results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))

    return results

Using nodeTree() in your application:

Expand|Select|Wrap|Line Numbers

 >>> contributors = xmlDoc.documentElement.getElementsByTagName( 'Contributor' )

>>> for contributor in contributors:

...     print "\n".join(nodeTree(contributor))

...     

Contributor

  Person

    First

      Yael

    Last

      Strulovici-Barel

  Email

    yas2003@med.cornell.edu

  Phone

    646-962-5560

  Laboratory

    Crystal

  Department

    Department of Genetic Medicine

  Organization

    Weill Cornell Medical College

  Address

    Line

      1300 York Avenue

    City

      New York

    State

      NY

    Zip-Code

      10021

    Country

      USA

Contributor

  Organization
 
  Email

    geo@ncbi.nlm.nih.gov, support@affymetrix.com

  Phone

    888-362-2447

  Organization

    Affymetrix, Inc.

  Address

    City

      Santa Clara

    State

      CA

    Zip-Code

      95051

    Country

      USA

  Web-Link

    http://www.affymetrix.com/index.affx

Contributor

  Person

    First

      Brendan

    Last

      Carolan

Contributor

  Person

    First

      Ben-Gary

    Last

      Harvey

Contributor

  Person

    First

      Bishnu

    Middle

      P

    Last

      De

Contributor

  Person

    First

      Holly

    Last

      Vanni

Contributor

  Person

    First

      Ronald

    Middle

      G

    Last

      Crystal

>>>

Since we are not extracting attributes, I modified the title of this thread.

BV - Moderator

Nov 19 '10 #4

haobijam

Hello,

Thanks for your help. I do have assembled and run the script but there was an error while running it on Platform section at line number 86 in MINiML.xml file. When i remove this line and run the script it prints correctly what we want in the output. The error in output prints like -

>>>

Traceback (most recent call last):
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 45, in <module>
print "\n".join(nodeTree(contributor))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 32, in nodeTree
results.extend(nodeTree(node, pad+2))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 34, in nodeTree
results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))
File "C:\Users\haojam\Desktop\GEO\GSE10006\test2.py ", line 15, in getTextFromElem
textList.append(str(n.nodeValue.strip()))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 379: ordinal not in range(128)

Expand|Select|Wrap|Line Numbers

 #!/usr/bin/python

import xml.dom.minidom
 
# Load the Contibutor collection

MINiML = xml.dom.minidom.parse ( 'MINiML.xml' )
 
def getTextFromElem(parent):

    '''Return a list of text found in the child nodes of a

    parent node, discarding whitespace.'''

    textList = []

    for n in parent.childNodes:

        # TEXT_NODE - 3

        if n.nodeType == 3 and n.nodeValue.strip():

            textList.append(str(n.nodeValue.strip()))

    return textList
 
def getElemChildren(parent):

    # Return a list of element nodes below parent

    elements = []

    for obj in parent.childNodes:

        if obj.nodeType == obj.ELEMENT_NODE:

            elements.append(obj)

    return elements
 
def nodeTree(element, pad=0):

    # Return list of strings representing the node tree below element

    results = ["%s%s" % (pad*" ", str(element.nodeName))]

    nextElems = getElemChildren(element)

    if nextElems:

        for node in nextElems:

            results.extend(nodeTree(node, pad+2))

    else:

        results.append("%s%s" % ((pad+2)*" ", ", ".join(getTextFromElem(element))))

    return results
 
contributors = MINiML.documentElement.getElementsByTagName( 'Contributor' )

for contributor in contributors:

    print "\n".join(nodeTree(contributor))

contributors = MINiML.documentElement.getElementsByTagName( 'Database' )

for contributor in contributors:

    print "\n".join(nodeTree(contributor))

contributors = MINiML.documentElement.getElementsByTagName( 'Platform' )

for contributor in contributors:

    print "\n".join(nodeTree(contributor))

contributors = MINiML.documentElement.getElementsByTagName( 'Sample' )

for contributor in contributors:

    print "\n".join(nodeTree(contributor))

contributors = MINiML.documentElement.getElementsByTagName( 'Series' )

for contributor in contributors:

    print "\n".join(nodeTree(contributor))

Regards,
Haobijam

Nov 21 '10 #5

haobijam

Hello,

The output for this python code is attached here but the line number 86 in MINiML.xml file is not printed. This an error. Please see the output.

Regards,
Haobijam

Attached Files

output.txt (199.8 KB, 332 views)

Nov 21 '10 #6

bvdet

2,851

Expert Mod 2GB

I think the word "GenBank\xae" is the problem. I'm not sure what to do about that. You might try ElementTree to parse the file.

Nov 21 '10 #7

Similar topics

Sort list of files in directory

by: Jean-Paul Lauque | last post by:

Hello, I'm beginning in the ASP world... I would like to sort (descending) list of files in directory. Parameter is directory url. How I can to do that with ASP not ASP.NET.

ASP / Active Server Pages

Getting list of files in directory from a web site

by: Opa | last post by:

Hi, Does anyone know how to get a list of files for a given directory from a given url. I have the following, but get an error indicating that URI formats are not supported. ...

.NET Framework

Get list of files in directory using a SP

by: jpasqua | last post by:

Is there an XP/SP out there that will return a list of files residing in a specified directory? I'm looking for something simlar to Execute master..xp_subdirs N'C:\' But instead of it...

Microsoft SQL Server

Reading a list of files from a directory

by: Nick | last post by:

Is it possible to read a list of files from a specified directory using VB.net We have company intranet and I have created a page that displays photos from different events. I have coded a page...

ASP.NET

Loop thru all subfolders and list all files under each

by: Bill Nguyen | last post by:

I need a VB routine to loop thru a select top folder to find all subfolders and list all subfolders/files under each of these subfolders. Any help is greatly appreciated. Bill

Visual Basic .NET

list system files

by: greg chu | last post by:

If I type dir /s in the command line prompt I can list all system files ex. get into the IE temporary files C:\Documents and Settings\XXXXXX\Local Settings\Temporary Internet Files XXXXXX is...

Visual Basic .NET

Using Gnutar to remove a list of files

by: cmacn024 | last post by:

Hi folks, I've got a question for yas. I'm trying to write code that will open up a gzipped tar file using gnutar, and copy the list of files(including their directories) to a list variable in...

Python

How do you get a list of files from server folder to display on in a list control

by: maglev_now | last post by:

I'm using .net 1.1 trying to get a list of files in folder on the server. The user would select the file they want to download from a DropDownList. Can someone tell me how this should be done? I...

ASP.NET

Data Parsing using existing data files in Perl

by: Gary42103 | last post by:

Hi I need Perl Script to do Data Parsing using existing data files. I have my existing data files in the following directory: Directory Name: workfs/ams Data File Names: 20070504.dat,...

Perl

Parsing tab separated .txt files with common and distinct attributes

by: haobijam | last post by:

I would like to parse tab separated .txt files separating common attribute and distinct attribute from the file. I would like to parse only the first line attributes not the values. Could you please...

Python

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware