473,385 Members | 1,357 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Perl script to join Tab Delimitted File elements

Hi,

The first part of my script works fine. Basically, the script reads a file with IDs that I want to search from a Flatdatabase and pull information. I setup the a TILDA Delimitted File and then I setup the two Tab Delimitted Files.
Located Here:
ftp://genome-ftp.stanford.edu/pub/yeast/data_download/chromosomal_feature/SGD_features.tab

ftp://genome-ftp.stanford.edu/pub/yeast/data_download/literature_curation/go_slim_mapping.tab

The SGD_features.tab part of the script works great and it does everything that I need it to. However, the go_slim_mapping.tab part of the script is where I am having some trouble. If you check out the link, you can see that there are multiple rows of the same SGDID.

What I need to do is lookup sgdid's and reference the go_aspect term (three choices: F,C,P) and the associated description under go_slim. If there are multiple entries of $sgdids with multiple F,C,P entries, join the specific entries (let's say C) together with a | delimmiter.

If you lookup S000004664 in the linked file, you can see about 8 rows with the same SGDID but each row has 1 associated go_aspect letter, with 1 definition for go_slim. So for this example, lets say it takes the C values and combines them so it looks like this

cytoplasm|membrane|mitochondrial envelope|mitochondrion

which will then be placed in the csv file for that particular SGDID under the Cellular Component column. I would need this same process to be done for the F, and P values and their respective columns Molecular Function and Biological Process.

If you don't understand what I am talking about, please ask and I'll try to explain it again.

Thanks for the help,
Hans

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2. use strict;
  3. use warnings;
  4.  
  5.  
  6. open IDS, "<partsgdids.txt";
  7. chomp (my @ids = <IDS> );
  8. close(IDS);
  9.  
  10. ##Tilda Delimitted File
  11. open (MYFILE, '>data.csv');
  12. print MYFILE "SGDID~ORF~Standard_Name~Alias~Description~Name_Description~Molecular_Function~Biological_Process~Cellular_Component~Define~Mutant_Phenotype\n";
  13.  
  14. open (SGDFEAT, "SGD_features.tab") || die "File not found\n";
  15. chomp (my @sgdfeats=<SGDFEAT>);
  16. close (SGDFEAT);
  17.  
  18. open (SLIMMAP, "go_slim_mapping.tab") || die "File not found\n";
  19. chomp (my @slim=<SLIMMAP>);
  20. close (SLIMMAP);
  21.  
  22.  
  23. ##List of columns $sgdid, $feat_type, $feat_qual, $feat_name, $stnd_name, $alias, 
  24. ##$parent, $sec_sgdid, $chrom, $start_coord, $stop_coord, $strand, $genetic_pos, 
  25. ##$coord_ver, $seq_vers, $desc
  26.  
  27. my %feat_type = ();
  28. my %feat_qual = ();
  29. my %feat_name = ();
  30. my %stnd_name = ();
  31. my %desc = ();
  32. my %alias = ();
  33. my %parent = ();
  34. my %sec_sgdid = ();
  35. my %chrom = ();
  36. my %start_coord = ();
  37. my %stop_coord = ();
  38. my %strand = ();
  39. my %genetic_pos = ();
  40. my %coord_ver = ();
  41. my %seq_vers = ();
  42.  
  43. foreach my $i (@sgdfeats) {
  44.     my ($sgdid, $feat_type, $feat_qual, $feat_name, $stnd_name, $alias, $parent, $sec_sgdid, $chrom, $start_coord, $stop_coord, $strand, $genetic_pos, $coord_ver, $seq_vers, $desc) = split(/\t/, $i);
  45.     $feat_type{$sgdid} = $feat_type;
  46.     $feat_qual{$sgdid} = $feat_qual;
  47.     $feat_name{$sgdid} = $feat_name;
  48.     $stnd_name{$sgdid} = $stnd_name;
  49.     $desc{$sgdid} = $desc;
  50.     $alias{$sgdid} = $alias;
  51.     $parent{$sgdid} = $parent;
  52.     $sec_sgdid{$sgdid} = $sec_sgdid;
  53.     $chrom{$sgdid} = $chrom;
  54.     $start_coord{$sgdid} = $start_coord;
  55.     $stop_coord{$sgdid} = $stop_coord;
  56.     $strand{$sgdid} = $strand;
  57.     $genetic_pos{$sgdid} = $genetic_pos;
  58.     $coord_ver{$sgdid} = $coord_ver;
  59.     $seq_vers{$sgdid} = $seq_vers;
  60. }
  61.  
  62.  
  63.  
  64. ##List of Columns for go_slim: $orf, $gene, $sgdid, $go_aspect, $go_slim, $goid, $feature_type
  65.  
  66. ##NEED HELP HERE
  67. my %orf = ();
  68. my %gene = ();
  69. #my %sgdid = ();
  70. my %go_aspect = ();
  71. my %go_slim = ();
  72. my %goid = ();
  73. my %feature_type = ();
  74.  
  75. foreach my $p (@slim) 
  76.     {
  77.     my ($orf, $gene, $sgdid, $go_aspect, $go_slim, $goid, $feature_type) = split(/\t/, $p);
  78.       #$orf{$sgdid} = $orf;
  79.     #$gene{$sgdid} = $gene;
  80.     #$go_aspect{$sgdid} = $go_aspect;
  81.     #$go_slim{$sgdid} = $go_slim;
  82.     #$goid{$sgdid} = $goid;
  83.     #$feature_type{$sgdid} = $feature_type;
  84.  
  85. }
  86.  
  87.  
  88. foreach my $ids (@ids) {
  89.     print MYFILE "$ids~$feat_name{$ids}~$stnd_name{$ids}~$alias{$ids}~$desc{$ids}~\n"
  90. }
  91.  
  92.  
Oct 8 '08 #1
1 4097
KevinADC
4,059 Expert 2GB
Looking at the lines you mentioned:

Expand|Select|Wrap|Line Numbers
  1. YMR060C    SAM37    S000004664    C    cytoplasm    GO:0005737    ORF|Verified
  2. YMR060C    SAM37    S000004664    C    membrane    GO:0016020    ORF|Verified
  3. YMR060C    SAM37    S000004664    C    mitochondrial envelope    GO:0005740    ORF|Verified
  4. YMR060C    SAM37    S000004664    C    mitochondrion    GO:0005739    ORF|Verified
  5. YMR060C    SAM37    S000004664    F    protein binding    GO:0005515    ORF|Verified
  6. YMR060C    SAM37    S000004664    P    anatomical structure morphogenesis    GO:0009653    ORF|Verified
  7. YMR060C    SAM37    S000004664    P    membrane organization    GO:0016044    ORF|Verified
  8. YMR060C    SAM37    S000004664    P    organelle organization    GO:0006996    ORF|Verified
  9.  
From the code you posted, I take it that the SGDID is the third field: S000004664

And the lines appear to have 7 fields of data but the fifth field can have spaces in the data, for example: "mitochondrial envelope".

Is what I have said correct?
Oct 9 '08 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

20
by: Xah Lee | last post by:
Sort a List Xah Lee, 200510 In this page, we show how to sort a list in Python & Perl and also discuss some math of sort. To sort a list in Python, use the “sort” method. For example: ...
4
by: Ignoramus6539 | last post by:
There were some strange requests to my server asking for config.php file (which I do not have in the requested location). I did some investigation. Seems to be a virus written in perl,...
0
by: supern | last post by:
this is my perl script saved as login.pl #!c:/perl/bin/perl.exe $basedir="c:/program files/apache software foundation/apache2.2/cgi-bin"; $datafile="regstr.txt"; $name=$in{'login'};...
20
by: Shawn Milo | last post by:
I'm new to Python and fairly experienced in Perl, although that experience is limited to the things I use daily. I wrote the same script in both Perl and Python, and the output is identical. The...
4
by: jonathan184 | last post by:
Hi I have a perl script, basically what it is suppose to do is check a folder with files. Now the files are checked using a timestamp with the command ls -l so the timestamp in this format is...
10
by: happyse27 | last post by:
Hi All, I got this apache errors(see section A1 and A2 below) when I used a html(see section b below) to activate acctman.pl(see section c below). Section D below is part of the configuration...
1
KevinADC
by: KevinADC | last post by:
Note: You may skip to the end of the article if all you want is the perl code. Introduction Many websites have a form or a link you can use to download a file. You click a form button or click...
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
by: ryjfgjl | last post by:
In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.