By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
438,470 Members | 1,907 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 438,470 IT Pros & Developers. It's quick & easy.

extracting from an XML file

P: 1
hello sir,
My aim is to extract 'id' and 'ac' from given XML files,and store the results in two different files.the code i wrote can extract 'ids',and give the output in a file.But i cant extract 'ac'.I want to extract all values of ac ,for eg
ac="Q708T3",ie the output file should contain only Q708T3.
Kindly provide a solution.

The input file( ie XML ) is as follows:

<?xml version="1.0" ?>
- <EBIApplicationResult xmlns="http://www.ebi.ac.uk/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ebi.ac.uk/schema/ApplicationResult.xsd">
- <Header>
<program name="WU-blastp" version="2.0MP-WashU [01-Jan-2006]" citation="PMID:12824421" />
- <parameters>
- <sequences total="1">
<sequence number="1" name="Sequence" type="p" length="149" />
</sequences>
- <databases total="1" sequences="241242" letters="88541632">
<database number="1" name="swissprot" type="p" created="2006-10-29T23:34:03+00:00" />
</databases>
<scores>100</scores>
<alignments>50</alignments>
<matrix>BLOSUM62</matrix>
<expectationUpper>10</expectationUpper>
<statistics>sump</statistics>
</parameters>
<timeInfo start="2006-10-31T21:11:01+00:00" end="2006-10-31T21:11:03+00:00" search="PT02S" />
</Header>
- <SequenceSimilaritySearchResult>
- <hits total="31">
- <hit number="1" database="swissprot" id="MT_PODSI" ac="Q708T3" length="63" description="Metallothionein (MT).">
- <alignments total="2">
- <alignment number="1">
<score>48</score>
<bits>22.0</bits>
<expectation>0.051</expectation>
<probability>0.050</probability>
<identity>40</identity>
<positives>40</positives>
<querySeq start="15" end="34">NCHITINASECCLCCL--CCLC</querySeq>
<pattern>NC T CC CC C C</pattern>
<matchSeq start="24" end="45">NCKCTSCKKSCCSCCPAGCAKC</matchSeq>
</alignment>
- <alignment number="2">
<score>33</score>
<bits>16.7</bits>
<expectation>0.051</expectation>
<probability>0.050</probability>
<identity>45</identity>
<positives>54</positives>
<querySeq start="58" end="68">RCNTFCXCLEP</querySeq>
<pattern>+C C C EP</pattern>
<matchSeq start="44" end="54">KCAKSCVCKEP</matchSeq>
</alignment>
</alignments>
</hit>
- <hit number="2" database="swissprot" id="IBB4_DOLAX" ac="P01059" length="76" description="Bowman-Birk type proteinase inhibitor DE-4.">
- <alignments total="1">
- <alignment number="1">
<score>62</score>
<bits>26.9</bits>
<expectation>0.19</expectation>
<probability>0.18</probability>
<identity>27</identity>
<positives>44</positives>
<querySeq start="2" end="36">CIDICMAMMALIANCHIT-INASECCLCCLCCLCIL</querySeq>
<pattern>C D+C ++ CH + + C C C+C L</pattern>
<matchSeq start="15" end="50">CCDLCTCTKSIPPQCHCNDMRLNSCHSACKSCICAL</matchSeq>
</alignment>
</alignments>
</hit>
- <hit number="3" database="swissprot" id="IBBC2_SOYBN" ac="P01063" length="83" description="Bowman-Birk type proteinase inhibitor C-II precursor.">
- <alignments total="1">
- <alignment number="1">
<score>61</score>
<bits>26.5</bits>
<expectation>0.26</expectation>
<probability>0.23</probability>
<identity>32</identity>
<positives>44</positives>
<querySeq start="2" end="34">CIDICMAMMALIANCHIT-INASECCLCCLCCLC</querySeq>
<pattern>C D+CM ++ CH I + C C C C</pattern>
<matchSeq start="21" end="54">CCDLCMCTASMPPQCHCADIRLNSCHSACDRCAC</matchSeq>
</alignment>
</alignments>
</hit>
etc...


the code i wrote is:
[code]

import java.io.*;
import java.lang.*;
import java.util.*;
import java.sql.*;

public class NameHandler
{

public static void main(String[] args)
{
new NameHandler().runProgram();
}

public void runProgram()
{
try
{
PrintWriter pw1 = new PrintWriter (new FileWriter("outIDS.txt"));
String line="";

String swissprot = "swissprot";
BufferedReader br1=new BufferedReader(new FileReader("blast-20061031-21110099.xml"));
int i=0;
while((line=br1.readLine())!= null)
{


if(line.startsWith(" <hit number"))
{
i++;
if(i<=10)
{
String eleminate =" <hit number="+i+"database="+"swissprot"+" "+"id="+"\"";


String valuefrom = new NameHandler().getElement(line,eleminate);

String trimmed = valuefrom.trim();
pw1.println(trimmed);
}

}


}
pw1.flush();
pw1.close();



}
catch(Exception e)
{}
}
public String getElement(String line, String tagName)
{

int length = tagName.length();
line = line.substring(length);

String value="";



System.out.println("index="+length);


value = line.substring(5,line.lastIndexOf(" ac")-1);


return value;


}
}
Nov 15 '06 #1
Share this Question
Share on Google+
1 Reply


10K+
P: 13,264
hello sir,
My aim is to extract 'id' and 'ac' from given XML files,and store the results in two different files.the code i wrote can extract 'ids',and give the output in a file.But i cant extract 'ac'.I want to extract all values of ac ,for eg
ac="Q708T3",ie the output file should contain only Q708T3.
Kindly provide a solution.

The input file( ie XML ) is as follows:

<?xml version="1.0" ?>
- <EBIApplicationResult xmlns="http://www.ebi.ac.uk/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://www.ebi.ac.uk/schema/ApplicationResult.xsd">
- <Header>
<program name="WU-blastp" version="2.0MP-WashU [01-Jan-2006]" citation="PMID:12824421" />
- <parameters>
- <sequences total="1">
<sequence number="1" name="Sequence" type="p" length="149" />
</sequences>
- <databases total="1" sequences="241242" letters="88541632">
<database number="1" name="swissprot" type="p" created="2006-10-29T23:34:03+00:00" />
</databases>
<scores>100</scores>
<alignments>50</alignments>
<matrix>BLOSUM62</matrix>
<expectationUpper>10</expectationUpper>
<statistics>sump</statistics>
</parameters>
<timeInfo start="2006-10-31T21:11:01+00:00" end="2006-10-31T21:11:03+00:00" search="PT02S" />
</Header>
- <SequenceSimilaritySearchResult>
- <hits total="31">
- <hit number="1" database="swissprot" id="MT_PODSI" ac="Q708T3" length="63" description="Metallothionein (MT).">
- <alignments total="2">
- <alignment number="1">
<score>48</score>
<bits>22.0</bits>
<expectation>0.051</expectation>
<probability>0.050</probability>
<identity>40</identity>
<positives>40</positives>
<querySeq start="15" end="34">NCHITINASECCLCCL--CCLC</querySeq>
<pattern>NC T CC CC C C</pattern>
<matchSeq start="24" end="45">NCKCTSCKKSCCSCCPAGCAKC</matchSeq>
</alignment>
- <alignment number="2">
<score>33</score>
<bits>16.7</bits>
<expectation>0.051</expectation>
<probability>0.050</probability>
<identity>45</identity>
<positives>54</positives>
<querySeq start="58" end="68">RCNTFCXCLEP</querySeq>
<pattern>+C C C EP</pattern>
<matchSeq start="44" end="54">KCAKSCVCKEP</matchSeq>
</alignment>
</alignments>
</hit>
- <hit number="2" database="swissprot" id="IBB4_DOLAX" ac="P01059" length="76" description="Bowman-Birk type proteinase inhibitor DE-4.">
- <alignments total="1">
- <alignment number="1">
<score>62</score>
<bits>26.9</bits>
<expectation>0.19</expectation>
<probability>0.18</probability>
<identity>27</identity>
<positives>44</positives>
<querySeq start="2" end="36">CIDICMAMMALIANCHIT-INASECCLCCLCCLCIL</querySeq>
<pattern>C D+C ++ CH + + C C C+C L</pattern>
<matchSeq start="15" end="50">CCDLCTCTKSIPPQCHCNDMRLNSCHSACKSCICAL</matchSeq>
</alignment>
</alignments>
</hit>
- <hit number="3" database="swissprot" id="IBBC2_SOYBN" ac="P01063" length="83" description="Bowman-Birk type proteinase inhibitor C-II precursor.">
- <alignments total="1">
- <alignment number="1">
<score>61</score>
<bits>26.5</bits>
<expectation>0.26</expectation>
<probability>0.23</probability>
<identity>32</identity>
<positives>44</positives>
<querySeq start="2" end="34">CIDICMAMMALIANCHIT-INASECCLCCLCCLC</querySeq>
<pattern>C D+CM ++ CH I + C C C C</pattern>
<matchSeq start="21" end="54">CCDLCMCTASMPPQCHCADIRLNSCHSACDRCAC</matchSeq>
</alignment>
</alignments>
</hit>
etc...


the code i wrote is:
[code]

import java.io.*;
import java.lang.*;
import java.util.*;
import java.sql.*;

public class NameHandler
{

public static void main(String[] args)
{
new NameHandler().runProgram();
}

public void runProgram()
{
try
{
PrintWriter pw1 = new PrintWriter (new FileWriter("outIDS.txt"));
String line="";

String swissprot = "swissprot";
BufferedReader br1=new BufferedReader(new FileReader("blast-20061031-21110099.xml"));
int i=0;
while((line=br1.readLine())!= null)
{


if(line.startsWith(" <hit number"))
{
i++;
if(i<=10)
{
String eleminate =" <hit number="+i+"database="+"swissprot"+" "+"id="+"\"";


String valuefrom = new NameHandler().getElement(line,eleminate);

String trimmed = valuefrom.trim();
pw1.println(trimmed);
}

}


}
pw1.flush();
pw1.close();



}
catch(Exception e)
{}
}
public String getElement(String line, String tagName)
{

int length = tagName.length();
line = line.substring(length);

String value="";



System.out.println("index="+length);


value = line.substring(5,line.lastIndexOf(" ac")-1);


return value;


}
}
In your xml file,
1) is it always the case that id and ac occur in lines starting with <hit number = ..?
2)Does ac always appear immediately after id?
3)Did you say you can get ids just fine?
Nov 15 '06 #2

Post your reply

Sign in to post your reply or Sign up for a free account.