By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,903 Members | 1,104 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,903 IT Pros & Developers. It's quick & easy.

Perl script to join Tab Delimitted File elements

P: 1
Hi,

The first part of my script works fine. Basically, the script reads a file with IDs that I want to search from a Flatdatabase and pull information. I setup the a TILDA Delimitted File and then I setup the two Tab Delimitted Files.
Located Here:
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/SGD_features.tab

ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/go_slim_mapping.tab

The SGD_features.tab part of the script works great and it does everything that I need it to. However, the go_slim_mapping.tab part of the script is where I am having some trouble. If you check out the link, you can see that there are multiple rows of the same SGDID.

What I need to do is lookup sgdid's and reference the go_aspect term (three choices: F,C,P) and the associated description under go_slim. If there are multiple entries of $sgdids with multiple F,C,P entries, join the specific entries (let's say C) together with a | delimmiter.

If you lookup S000004664 in the linked file, you can see about 8 rows with the same SGDID but each row has 1 associated go_aspect letter, with 1 definition for go_slim. So for this example, lets say it takes the C values and combines them so it looks like this

cytoplasm|membrane|mitochondrial envelope|mitochondrion

which will then be placed in the csv file for that particular SGDID under the Cellular Component column. I would need this same process to be done for the F, and P values and their respective columns Molecular Function and Biological Process.

If you don't understand what I am talking about, please ask and I'll try to explain it again.

Thanks for the help,
Hans

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5.  
  6. open IDS, "<partsgdids.txt";
  7. chomp (my @ids = <IDS> );
  8. close(IDS);
  9.  
  10. ##Tilda Delimitted File
  11. open (MYFILE, '>data.csv');
  12. print MYFILE "SGDID~ORF~Standard_Name~Alias~Description~Name_Description~Molecular_Function~Biological_Process~Cellular_Component~Define~Mutant_Phenotype\n";
  13.  
  14. open (SGDFEAT, "SGD_features.tab") || die "File not found\n";
  15. chomp (my @sgdfeats=<SGDFEAT>);
  16. close (SGDFEAT);
  17.  
  18. open (SLIMMAP, "go_slim_mapping.tab") || die "File not found\n";
  19. chomp (my @slim=<SLIMMAP>);
  20. close (SLIMMAP);
  21.  
  22.  
  23. ##List of columns $sgdid, $feat_type, $feat_qual, $feat_name, $stnd_name, $alias, 
  24. ##$parent, $sec_sgdid, $chrom, $start_coord, $stop_coord, $strand, $genetic_pos, 
  25. ##$coord_ver, $seq_vers, $desc
  26.  
  27. my %feat_type = ();
  28. my %feat_qual = ();
  29. my %feat_name = ();
  30. my %stnd_name = ();
  31. my %desc = ();
  32. my %alias = ();
  33. my %parent = ();
  34. my %sec_sgdid = ();
  35. my %chrom = ();
  36. my %start_coord = ();
  37. my %stop_coord = ();
  38. my %strand = ();
  39. my %genetic_pos = ();
  40. my %coord_ver = ();
  41. my %seq_vers = ();
  42.  
  43. foreach my $i (@sgdfeats) {
  44.     my ($sgdid, $feat_type, $feat_qual, $feat_name, $stnd_name, $alias, $parent, $sec_sgdid, $chrom, $start_coord, $stop_coord, $strand, $genetic_pos, $coord_ver, $seq_vers, $desc) = split(/\t/, $i);
  45.     $feat_type{$sgdid} = $feat_type;
  46.     $feat_qual{$sgdid} = $feat_qual;
  47.     $feat_name{$sgdid} = $feat_name;
  48.     $stnd_name{$sgdid} = $stnd_name;
  49.     $desc{$sgdid} = $desc;
  50.     $alias{$sgdid} = $alias;
  51.     $parent{$sgdid} = $parent;
  52.     $sec_sgdid{$sgdid} = $sec_sgdid;
  53.     $chrom{$sgdid} = $chrom;
  54.     $start_coord{$sgdid} = $start_coord;
  55.     $stop_coord{$sgdid} = $stop_coord;
  56.     $strand{$sgdid} = $strand;
  57.     $genetic_pos{$sgdid} = $genetic_pos;
  58.     $coord_ver{$sgdid} = $coord_ver;
  59.     $seq_vers{$sgdid} = $seq_vers;
  60. }
  61.  
  62.  
  63.  
  64. ##List of Columns for go_slim: $orf, $gene, $sgdid, $go_aspect, $go_slim, $goid, $feature_type
  65.  
  66. ##NEED HELP HERE
  67. my %orf = ();
  68. my %gene = ();
  69. #my %sgdid = ();
  70. my %go_aspect = ();
  71. my %go_slim = ();
  72. my %goid = ();
  73. my %feature_type = ();
  74.  
  75. foreach my $p (@slim) 
  76.     {
  77.     my ($orf, $gene, $sgdid, $go_aspect, $go_slim, $goid, $feature_type) = split(/\t/, $p);
  78.       #$orf{$sgdid} = $orf;
  79.     #$gene{$sgdid} = $gene;
  80.     #$go_aspect{$sgdid} = $go_aspect;
  81.     #$go_slim{$sgdid} = $go_slim;
  82.     #$goid{$sgdid} = $goid;
  83.     #$feature_type{$sgdid} = $feature_type;
  84.  
  85. }
  86.  
  87.  
  88. foreach my $ids (@ids) {
  89.     print MYFILE "$ids~$feat_name{$ids}~$stnd_name{$ids}~$alias{$ids}~$desc{$ids}~\n"
  90. }
  91.  
  92.  
Oct 8 '08 #1
Share this Question
Share on Google+
1 Reply


KevinADC
Expert 2.5K+
P: 4,059
Looking at the lines you mentioned:

Expand|Select|Wrap|Line Numbers
  1. YMR060C    SAM37    S000004664    C    cytoplasm    GO:0005737    ORF|Verified
  2. YMR060C    SAM37    S000004664    C    membrane    GO:0016020    ORF|Verified
  3. YMR060C    SAM37    S000004664    C    mitochondrial envelope    GO:0005740    ORF|Verified
  4. YMR060C    SAM37    S000004664    C    mitochondrion    GO:0005739    ORF|Verified
  5. YMR060C    SAM37    S000004664    F    protein binding    GO:0005515    ORF|Verified
  6. YMR060C    SAM37    S000004664    P    anatomical structure morphogenesis    GO:0009653    ORF|Verified
  7. YMR060C    SAM37    S000004664    P    membrane organization    GO:0016044    ORF|Verified
  8. YMR060C    SAM37    S000004664    P    organelle organization    GO:0006996    ORF|Verified
  9.  
From the code you posted, I take it that the SGDID is the third field: S000004664

And the lines appear to have 7 fields of data but the fifth field can have spaces in the data, for example: "mitochondrial envelope".

Is what I have said correct?
Oct 9 '08 #2

Post your reply

Sign in to post your reply or Sign up for a free account.