By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,903 Members | 1,086 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,903 IT Pros & Developers. It's quick & easy.

perl regex

P: 89
I have a data file and 4th column looks like below: Some examples

34899939-34899967
34899939-34899967:34905554-34905559
34899939-34899967:34905554-34905559:34905560-34905574


I have to extract like below:
For the first line:
$start = 34899939
$end = 34899967
$block_size = 1

For the 2nd line:
$start = 34899939
$end = 34905559
$block_size = 2
$n1=34899939
$n2=34899967
$n3=34905554
$n4=34905559

For the 3rd line:
$start = 34899939
$end = 34905574
$block_size = 3
$n1=34899939
$n2=34899967
$n3=34905554
$n4=34905559
$n5=34905560
$n6=34905574

I am able to differentiate 1 block and 2 block depending upon : character and able to find the solution for the 3rd line as below:
Expand|Select|Wrap|Line Numbers
  1. sub special {
  2.  
  3.         chomp $_;
  4.         my @v = split(/\s+/,$_);        
  5.         if($v[3] =~ /\:/) {
  6.         $num1 = $`;
  7.         $num2 = $';
  8.                 if($num1 =~ /\-/) {
  9.                         $n1 = $`;
  10.                         $n2 = $';
  11.                 }
  12.                 if($num2 =~ /\-/) {
  13.                         $n3 = $`;
  14.                         $n4 = $';
  15.                 }
  16.         }
  17.         $start = $n1;
  18.         $end = $n4;
  19.         print "$n1 \t $n2 \t $n3 \t $n4 \n";
  20.  
  21. }
  22.  
But how do I generalise the numbers with : to generate the $n(i)? Thanks.
Feb 19 '09 #1
Share this Question
Share on Google+
4 Replies


KevinADC
Expert 2.5K+
P: 4,059
Your data is confusing. What do you mean "the 4th column looks like this"? You have posted three seperate lines of data. Are they part of a larger line of data? Why in your sub special() are you first splitting on spaces when there is no spaces in the data you posted?
Feb 19 '09 #2

P: 89
Hi Kevin, sorry for the confusion. Let me try to explain.

My 4th column can contain data as given in my previous post. It can contain without : or with one or two or three set of numbers separated by :

Hence they are different kinds of data available in 4th column of each data.

Every line I parse it to get the 4th column hence I split using space to get my 4th column data. And still parse with a special character : and then further split the example.

If my 4th column is like 1st example, then it is easy for me to split the numbers and put them into $n1 and $n2.

If my 4th column is like 2nd line example with one :, then my subroutine special does the job and assign $n1,$n2,$n3 and $n4.

But I want to write a generalised routine which can handle like line 3 or even with more number of:

I hope I have explained clearly.
Regards
Feb 19 '09 #3

KevinADC
Expert 2.5K+
P: 4,059
That helped clear it up. I hope I am not doing your school work for you.

Expand|Select|Wrap|Line Numbers
  1. while(<DATA>) {
  2.    special($_);
  3. }
  4.  
  5. sub special {
  6.    local ($_) = @_; 
  7.    chomp $_;
  8.    my $col4 = (split(/\s+/))[3];
  9.    my @blocks = split(/:/,$col4);
  10.    my @temp;
  11.    for (@blocks) {
  12.       push @temp, split(/-/);
  13.    }
  14.    print "start = $temp[0]\n";
  15.    print "end = $temp[-1]\n";
  16.    print 'blocks = ', scalar @blocks,"\n";
  17.    for (@temp) {
  18.       print "\t$_\n";
  19.    }
  20.    print "\n";
  21. }
  22. __DATA__
  23. dummy dummy dummy 34899939-34899967 dummy
  24. dummy dummy dummy 34899939-34899967:34905554-34905559 dummy 
  25. dummy dummy dummy 34899939-34899967:34905554-34905559:34905560-34905574 dummy 
  26.  
Apply your own file I/O inplace of DATA
Feb 19 '09 #4

KevinADC
Expert 2.5K+
P: 4,059
output is:

Expand|Select|Wrap|Line Numbers
  1. start = 34899939
  2. end = 34899967
  3. blocks = 1
  4.     34899939
  5.     34899967
  6.  
  7. start = 34899939
  8. end = 34905559
  9. blocks = 2
  10.     34899939
  11.     34899967
  12.     34905554
  13.     34905559
  14.  
  15. start = 34899939
  16. end = 34905574
  17. blocks = 3
  18.     34899939
  19.     34899967
  20.     34905554
  21.     34905559
  22.     34905560
  23.     34905574
  24.  
edit the output for your needs to display it how you need to. I added "start", "end" and "blocks" just to make it easier to read.
Feb 19 '09 #5

Post your reply

Sign in to post your reply or Sign up for a free account.