469,352 Members | 1,656 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,352 developers. It's quick & easy.

Compare two files in perl

4
I have a problem. Currently I am trying to compare two text files which has high amount of data. I have developed a perl script to cross check both files. But it takes very long time. The codes are working fine for small number of data. The sample files are attached here.

I want the 1st line of chr.txt file to check all the lines in exon.txt. it should repeat the process until all the lines from chr.txt is checked with lines from exon.txt.

This the code which i developed.

Expand|Select|Wrap|Line Numbers
  1. use strict;
  2. use warnings;
  3.  
  4. my $file1 = "exon.txt";
  5. my $file2 = "chr.txt";
  6.  
  7. open(FILE1, $file1) || die "couldn't open the file!";
  8. open(FILE2, $file2) || die "couldn't open the file!";
  9.  
  10. open(OUT,">result.txt");
  11.  
  12. my @arr1 =<FILE1>;
  13. my @arr2 =<FILE2>;
  14.  
  15. foreach my $arr1 (@arr1){
  16.  
  17.     chomp $arr1;
  18.     my ($eChr,$eStart,$eEnd,$eCat)=split(/\t/,$arr1);
  19.  
  20.     foreach my $arr2 (@arr2) {
  21.  
  22.         my($cChr, $cStart, $cEnd)=split(/\t/, $arr2);
  23.         if (($mChr eq $eChr)&&($mStart >= $eStart) && ($mEnd <= $eEnd)) {
  24.                 print OUT "$mChr\t$mStart\t$mEnd\t$eCat\t$eStart\t$eEnd\n";
  25.  
  26.                 }
  27.             }
  28.         }
  29. close(FILE1);
  30. close(FILE2);
  31. close OUT;
  32.  
Attached Files
File Type: txt chr.txt (75 Bytes, 756 views)
File Type: txt exon.txt (76 Bytes, 649 views)
Feb 7 '14 #1
8 8119
RonB
589 Expert Mod 512MB
You're looping over the data too many times.

Load the first file (exon.txt) into a HoAoH (Hash of Array of Hashes) where the key is the "chr" and the hash ref would hold the rest of the data. Then loop over the chr.txt file line-by-line checking for the existence of the "chr" key.

The sample data you posted won't produce any matching results, but presumably your real data set will.

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/perl
  2.  
  3. use strict;
  4. use warnings;
  5.  
  6. my $file1 = "exon.txt";
  7. open my $exon_fh, '<', $file1 or die "couldn't open $file1 $!";
  8.  
  9. my %exon;
  10. while (my $line = <$exon_fh>) {
  11.     next if $line =~ /^\s*$/;
  12.     chomp $line;
  13.     my ($chr,$start,$end,$cat) = split(/\t/, $line);
  14.     push @{$exon{$chr}}, {
  15.         start => $start,
  16.         end   => $end,
  17.         cat   => $cat,
  18.     };
  19. }
  20. close $exon_fh;
  21.  
  22. my $file2 = "chr.txt";
  23. open my $chr_fh, '<', $file2 or die "couldn't open $file2 $!";
  24.  
  25. while (my $line = <$chr_fh>) {
  26.     next if $line =~ /^\s*$/;
  27.     chomp $line;
  28.     my ($chr,$start,$end) = split(/\t/, $line);
  29.     next unless exists $exon{$chr};
  30.  
  31.     foreach my $exon ( $exon{$chr} ) {
  32.         if ($start >= $exon{start} && $end <= $exon{end} ) {
  33.             print join("\t", $chr, $start, $exon{start}, $end <= $exon{end}) . "\n";
  34.         }
  35.     }
  36. }
  37. close $chr_fh;
  38.  
Feb 7 '14 #2
RonB
589 Expert Mod 512MB
I just noticed that I had an error in the print statement. It should be:
Expand|Select|Wrap|Line Numbers
  1. print join("\t", $chr, $start, $exon{cat}, $exon{start}, $exon{end}) . "\n";
Feb 7 '14 #3
raj14
4
Thanks for the help RonB. But when I run this script, it prompts error. Use of uninitialized Value.

Can you explain this part.

Expand|Select|Wrap|Line Numbers
  1. push @{$exon{$chr}}, ;{
  2.         start&nbsp;=> $start,
  3.         end&nbsp;  => $end,
  4.         cat&nbsp;  => $cat,
  5.     };
Feb 17 '14 #4
RonB
589 Expert Mod 512MB
Which part do you want explained? The syntax errors that you added to the code I gave you or what the code should do without your syntax errors?
Feb 17 '14 #5
raj14
4
@RonB

The errors is "Use of uninitialized value in numeric ge (>=)". So i guess the syntax has some problem. This part of your syntax has error. I attach it here.
Expand|Select|Wrap|Line Numbers
  1. next if $line =~ /^\s*$/;
  2.         chomp $line;
  3.         my ($chr,$start,$end,$cat) = split(/\t/, $line);
  4.         push @{$exon{$chr}}, ;{
  5.             start => $start,
  6.             end  => $end,
  7.             cat  => $cat,
  8.         };
Feb 18 '14 #6
RonB
589 Expert Mod 512MB
Remove the semi colon in this line:
Expand|Select|Wrap|Line Numbers
  1. push @{$exon{$chr}}, ;{
Feb 18 '14 #7
raj14
4
Its still produce the same error.
Feb 18 '14 #8
RonB
589 Expert Mod 512MB
You didn't say which line the warning message was referring to.

The only line in the code I gave that does that numerical test is this one (line 32):
Expand|Select|Wrap|Line Numbers
  1. if ($start >= $exon{start} && $end <= $exon{end} ) {
You need to dump those 4 vars (via the Data::Dumper module) to see which one is undefined.
Feb 18 '14 #9

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

4 posts views Thread by Lad | last post: by
2 posts views Thread by SP | last post: by
8 posts views Thread by pjsimon | last post: by
3 posts views Thread by shona | last post: by
4 posts views Thread by Clay Hobbs | last post: by
reply views Thread by norseman | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.