By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
454,245 Members | 1,287 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 454,245 IT Pros & Developers. It's quick & easy.

Newbie...parsing from multiple lines.

P: 3
I'm trying to get my script to parse a bunch of files and grab data between the <title></> and <blah></> tags. Yes yes, I'm parsing html with regex, it works though. :)

The issue I have is sometimes there is one line, sometimes 30 lines, between <title> and <blah> so I can't just .+ it all the way. Plus there are multiple <blah> tags in each file. I'm looking for a way for to scan the file for <title>, assign to $1, then search for every instance of <blah> and assign to $2 and upwards as necessary. Then print to the tab file $1 \t $2 \t $3 etc. Boy I hope that jibberish made sense lol. I'm new so offering an explanation with hardcore jargon might not be good for me. Here's what I have so far:

Expand|Select|Wrap|Line Numbers
  1. #!/usr/bin/env perl
  2. #fix.py 
  3.  
  4.  
  5. $dir = 'e:\\tmp';
  6. $outdir = "newfiles";
  7. $tabfile = "tabdata.txt";
  8.  
  9.  
  10.  
  11.  
  12. ### EDIT CAREFULLY BELOW HERE :) ###
  13. open(TAB, ">$dir\\$outdir\\$tabfile");
  14. print TAB ("Item Name\tItem Number\tCost\tAdd\tIn All\n");
  15. open(PARTNUMBER, "$dir\\$outdir\\partnumber.txt");
  16. while (<PARTNUMBER>) {
  17.     chomp;
  18.     $i = $_;
  19. }
  20. close(PARTNUMBER);
  21. print "Opening $dir\n";
  22. opendir(DH,$dir);
  23. while (defined ( my $filename = readdir(DH))) {
  24.     if ($filename =~ m/\.htm/ ) {
  25.         $outfilename=">$dir\\$outdir\\$filename";
  26.         print "Opening $filename\n";
  27.         open(FHI,$filename);
  28.         while (<FHI>) {
  29.         $html .= $_;
  30.         }
  31.         close(FHI);
  32.             while ($html =~ s/<title>(.+?)<\/title>/$1$2$3$4/)
  33.             {
  34.         print TAB ("$1\t$2\t$3\t$i\n");
  35.         open (PARTNUMBER, ">$dir\\$outdir\\partnumber.txt");
  36.         print PARTNUMBER ($i);
  37.         close(PARTNUMBER);
  38.                 print "$i matches foung in $filename\n";
  39.                 print "Saving to $outfilename\n";
  40.             open(FHO, $outfilename);
  41.             print FHO ($html);
  42.             close(FHO);
  43.             }
  44.         }
  45.         $html = '';
  46. }
  47. print "Done\n";
  48.  
Thanks in advance!
Mar 25 '08 #1
Share this Question
Share on Google+
2 Replies


KevinADC
Expert 2.5K+
P: 4,059
some sample input and sample output would probably help.
Mar 26 '08 #2

eWish
Expert 100+
P: 971
Is this the line you are using to capture the data between the title tags?
Expand|Select|Wrap|Line Numbers
  1. while ($html =~ s/<title>(.+?)<\/title>/$1$2$3$4/)
The reason I as is because s/// is the substitution operator.

--Kevin
Mar 26 '08 #3

Post your reply

Sign in to post your reply or Sign up for a free account.