By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
437,775 Members | 1,740 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 437,775 IT Pros & Developers. It's quick & easy.

Perl Scripting Newbie

P: 3
Hello. I have a large number of enormous text files (invoices) and I need a way to extract just the 8 digit invoice numbers from the rest of the text in the file. In excel, I developed a macro to delete all entries that were less than 89999999 (invoices are 8 characters and start with the number 9), remove duplicate entries, and sort. The macro works on a small scale but Excel absolutely crashes when dealing with large text files and this macro. A coworker (I am merely an intern) suggested that I try and use perl but I am running into some problems.

I just some direction as to how to create a script that will export all of the 8 digit numbers starting with the number 9 (preferably) and export it to a new text file. Ideally the script would be smart enough to name the new text file in a similar manner to the original. Eg: Invoices_01-01-2004-Extract. I have enclosed the original VB Excel code, although it may not do much good.

Thanks!

Expand|Select|Wrap|Line Numbers
  1. Sub TopKill()
  2.    ' Turn off screen updating to speed up macro.
  3. Application.ScreenUpdating = False
  4.  
  5.     Application.ScreenUpdating = False
  6.     Application.Calculation = xlCalculationManual   'pre XL97 xlManual
  7.     Dim rng As Range, i As Long   '// modified
  8.  
  9.     'Set the range to evaluate to rng.  // modified
  10.     Set rng = Intersect(Selection, ActiveSheet.UsedRange)
  11.     If rng Is Nothing Then
  12.        MsgBox "nothing in Intersected range to be checked"
  13.        GoTo done
  14.     End If
  15.  
  16.     'Loop backwards through the rows
  17.     'in the range that you want to evaluate.
  18.     '--- For i = rng.Rows.Count To 1 Step -1  // modified
  19.  
  20.     For i = rng.Count To 1 Step -1
  21.  
  22.         'This will kill all values that are non-invoice numbers
  23.         If rng.Cells(i).Value < 89999999 Then rng.Cells(i).EntireRow.Delete
  24.     Next
  25. done:
  26.     Application.Calculation = xlCalculationAutomatic
  27.     Application.ScreenUpdating = True
  28.  
  29. ' Search and destroy duplicates.
  30. iListCount = Sheets("Sheet1").Range("A1:A100").Rows.Count
  31. Sheets("Sheet1").Range("A1").Select
  32. ' Loop until end of records.
  33. Do Until ActiveCell = ""
  34.    ' Loop through records.
  35.    For iCtr = 1 To iListCount
  36.       ' Don't compare against yourself.
  37.       ' To specify a different column, change 1 to the column number.
  38.       If ActiveCell.Row <> Sheets("Sheet1").Cells(iCtr, 1).Row Then
  39.          ' Do comparison of next record.
  40.          If ActiveCell.Value = Sheets("Sheet1").Cells(iCtr, 1).Value Then
  41.             ' If match is true then delete row.
  42.             Sheets("Sheet1").Cells(iCtr, 1).Delete xlShiftUp
  43.                ' Increment counter to account for deleted row.
  44.                iCtr = iCtr + 1
  45.          End If
  46.       End If
  47.    Next iCtr
  48.    ' Go to next record.
  49.    ActiveCell.Offset(1, 0).Select
  50. Loop
  51.  
  52. ' Sort our data
  53.     Columns("A:A").Select
  54.     ActiveWorkbook.Worksheets("Sheet1").Sort.SortFields.Clear
  55.     ActiveWorkbook.Worksheets("Sheet1").Sort.SortFields.Add Key:=Range("A1:A25") _
  56.         , SortOn:=xlSortOnValues, Order:=xlAscending, DataOption:=xlSortNormal
  57.     With ActiveWorkbook.Worksheets("Sheet1").Sort
  58.         .SetRange Range("A1:A25")
  59.         .Header = xlGuess
  60.         .MatchCase = False
  61.         .Orientation = xlTopToBottom
  62.         .SortMethod = xlPinYin
  63.         .Apply
  64.  
  65.     End with
  66. MsgBox "Done!"
  67.  
  68. End Sub
Jul 8 '10 #1

✓ answered by toolic

Perl is a good choice for this task. Since you didn't show an example of your input file, I had to make several assumptions. The follwoing code will print 8-digit numbers which begin with 9, even if there are several of these numbers on a single line. All arguments passed on the command line are in the special @ARGV array variable.

Expand|Select|Wrap|Line Numbers
  1. use warnings;
  2. use strict;
  3.  
  4. die "Usage: extract.pl input.txt output.txt\n" unless @ARGV == 2;
  5. my ($infile, $outfile) = @ARGV;
  6.  
  7. open my $fh_in , '<', $infile  or die "can not open $infile: $!";
  8. open my $fh_out, '>', $outfile or die "can not open $outfile: $!";
  9. while (<$fh_in>) {
  10.     while (/(9[0-9]{7})/g) {
  11.         print $fh_out "$1\n";
  12.     }
  13. }
  14. close $fh_in;
  15. close $fh_out;
If the input file contains the following lines:

Expand|Select|Wrap|Line Numbers
  1. foo
  2. bar 91234567 goo
  3. 222 7777 12345678 zzz
  4. 99999999 9999 91111111
The the output fill will contain:

Expand|Select|Wrap|Line Numbers
  1. 91234567
  2. 99999999
  3. 91111111
There are some great tutorials on the official Perl documentation website
http://perldoc.perl.org

Specifically:
perlintro

Share this Question
Share on Google+
3 Replies


P: 3
Update: I got this little ditty to actually carry out the operation of isolating the invoice numbers. However, it looks for the pattern to start exactly 28 characters in, I haven't been able to tell it to find ALL 8 digit numbers starting with 9.
Expand|Select|Wrap|Line Numbers
  1. open(IN,'dummy.txt') or die("cannot open the log file for reading\n");
  2. open(OUT,'>output.txt') or die("cannot open the output file for writing\n");######To create the output file
  3. while(<IN>){######to read the input file
  4.     print OUT "$_" if($_=~ /^.{28}9[0-9]{7}/);    
  5. }
  6. close(IN);
  7. close(OUT);
  8.  
Another thing I would like to do is to be able to choose the files I want from the CMD line. Eg: perl extract.pl input.txt output.txt but whenever I do that, I get errors about the input file not being able to be opened.
[code]use strict;

Expand|Select|Wrap|Line Numbers
  1.  
  2. while(<>){
  3. $infile=$1;
  4. $outfile=$2;
  5. (IN,$infile) or die("cannot open the input file for reading\n");
  6. (OUT,$outfile) or die("cannot open the output file for writing\n");######To create the output file
  7. while(<IN>){######to read the input file
  8.     print "$_" if($_=~ /^.{28}(9[0-9]{7})/);
  9. }}
  10.  
  11.  
Does anyone have any suggestions?
Jul 9 '10 #2

Expert
P: 70
Perl is a good choice for this task. Since you didn't show an example of your input file, I had to make several assumptions. The follwoing code will print 8-digit numbers which begin with 9, even if there are several of these numbers on a single line. All arguments passed on the command line are in the special @ARGV array variable.

Expand|Select|Wrap|Line Numbers
  1. use warnings;
  2. use strict;
  3.  
  4. die "Usage: extract.pl input.txt output.txt\n" unless @ARGV == 2;
  5. my ($infile, $outfile) = @ARGV;
  6.  
  7. open my $fh_in , '<', $infile  or die "can not open $infile: $!";
  8. open my $fh_out, '>', $outfile or die "can not open $outfile: $!";
  9. while (<$fh_in>) {
  10.     while (/(9[0-9]{7})/g) {
  11.         print $fh_out "$1\n";
  12.     }
  13. }
  14. close $fh_in;
  15. close $fh_out;
If the input file contains the following lines:

Expand|Select|Wrap|Line Numbers
  1. foo
  2. bar 91234567 goo
  3. 222 7777 12345678 zzz
  4. 99999999 9999 91111111
The the output fill will contain:

Expand|Select|Wrap|Line Numbers
  1. 91234567
  2. 99999999
  3. 91111111
There are some great tutorials on the official Perl documentation website
http://perldoc.perl.org

Specifically:
perlintro
Jul 10 '10 #3

P: 3
Thank you! I managed to write a script to extract the data but this is much cleaner and more efficient!
Jul 13 '10 #4

Post your reply

Sign in to post your reply or Sign up for a free account.