Thanks for your help.
My script now looks like this:
#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my %count;
do {
$_ =~ s/^(\S+\s+){2}//;
$count{$_}++
} while <INFILE>;
print "$count{$_} $_" for keys %count;
__END__
So I'm feeding the file into the %count array by removing the first two
columns with the identifier information and then counting the keys.
How can I still keep the identifier part of the line linked to the array?
Since this is the part which I'm really interested in.
I can't keep the identifier in
the %count array, since this would screw up the "for keys" part.
I checked perldoc -q and found how to remove duplicates but I don't think
I can rewrite this to do what I want.
The "for keys" method is brillant but I'm losing the identifier.
So I'm back to my original script which looks like this.
#!/usr/bin/perl
# Perl script to find most common CS
use strict;
use warnings;
my $infile = "/home/martin/DATABASE/large.txt";
open INFILE, $infile or die "Shit! Couldn't open file $infile: $!\n";
my @array = <INFILE>;
print "There are ", $#array+1, " lines in the large array\n";
my (@table);
foreach my $array (@array) {
push(@table, [split(/\s/, $array) ]);
}
for (my $k =0; $k<=$#array; $k++) {
print "$table[$k][1] $table[$k][2] occurs ";
my $matched=0;
for (my $h =0; $h<=$no_lines; $h++) {
my $match=0;
for (my $j =2; $j<=11; $j++ ) {
if ($table[$k][$j] == $table[$h][$j]){
$match++;
}
}
if ($match==10) {
$matched++;
}
}
print "$matched times\n";
} # end of large loop
But this sad looking script is not very smart and very slow, I don't want to
run over each line. I would like the script to search the file,
identify a sequence as unique. If there are duplicate sequences
in that file then print out how many and do not revisit that line
if it has been counted as a duplicate.
my data file looks like this, a small section only.
810 141-2_1_2 4 10 21 37 58 83 111 145 184 226
811 141-2_1_6 4 12 24 42 64 92 124 162 204 252
812 141-2_1_7 4 11 23 44 67 95 134 168 215 271
879 141_1_2 4 10 21 37 58 83 111 145 184 226
880 141_1_6 4 12 24 42 64 92 124 162 204 252
881 141_1_7 4 11 23 44 67 95 134 168 215 271
882 152_1_15 4 12 26 44 72 104 138 178 228 282
883 152_1_23 4 10 21 40 65 96 134 180 230 286
884 152_1_24 4 10 21 40 65 96 134 180 230 286
885 152_1_3 4 12 22 40 66 102 128 168 218 268
Again many thanks for your help. I still don't get why you say
this newsgroup has been deleted. What is the url for the replacement
newsgroup?
no****@mail.com wrote in message news:<4d**************************@posting.google. com>...
md********@netscape.net (Martin Foster) wrote:
no****@mail.com wrote:
I shall assume that you really want to count the number of times each
distinct line appears in a file. perl -en '$c{$_}++; END { print "$c{$_} $_" for keys %c }' Or as a script: $count{$_}++ while <>; This is amazing, I don't understand how it works but it's very
powerful.
If you look in the newsgroup that replaced this one when this one was
deleted, you'll find every couple of months someone posts a script
substancially like the one above and says "I found this - how does it
work?".
You could look at one of those threads.
I believe it is also an example that is used in most Perl tutorials.
Can I se this script to compare the n columns of a file, no the entire
file.
No you can't use this _script_. But you can use the technique.
Rather than keying %count on the whole line you can use some sort of
string manipulation to extract just part of the line to consider. The
most normal way to manipulate strings in Perl is the m// and s///
operators.
I've got a identifier for each line at the beginning, for example
1666237 4 10 23 16 and so. The identifier is an id to link to
something else and so on. I just want to compare the 10 columns with
the numbers.
Well if, for example, we say the first 3 whitespace delimted columns
are the identifier you could remove them thus:
s/^(\S+\s+){3}// and $count{$_}++ while <>;
I also suggest you post to newsgroups that still exist (this one
doesn't, see FAQ). Your post will then be seen my many more people.
BTW where is the FAQ, which says this newsgroup no longer exists?
The Perl FAQ is part of the standard Perl documentation that can be
found on any computer on which Perl has been installed and also on
various Perl-related web sites.