Connecting Tech Pros Worldwide Forums | Help | Site Map

Need help with some very Practical Extraction

Newbie
 
Join Date: Jan 2007
Posts: 5
#1: Jan 28 '07
Hi,

What I'm trying to do seems right up Perl's alley, but I can't get it to work. I'm using the WWW::Mechanize module to retrieve a sprawling HTML document from which I want to extract certain strings and save them. I can get this much to work:
Expand|Select|Wrap|Line Numbers
  1. use WWW::Mechanize;
  2. $url = "http://someurl";
  3. my $mechanize = WWW::Mechanize->new(autocheck => 1);
  4. $mechanize->get($url);
  5. my @array_of_data = $mechanize->content;
  6.  
but now I am stuck on how to process that data.

The HTML doc is quite long, and contains numbers that I want to extract, numbers that are always preceeded by a text string that is the same each time, such as:

<a href bla bla bla>bla bla bla<random tag>mydigits=493409834%bla bla bla<meaningless tag>bla bla</a>

where the string "mydigits=" always preceeds the desired number and is sometimes all lowercase but can occasionally look like "MyDigits="; where the number itself may be anywhere from one to 10 digits in length; and where "%" might literally be "%" or any other non-digit character including a space. Moreover, the desired string might appear more than once per line -- assuming Perl doesn't see the HTML doc as just one single long line of text anyway.

What I have tried is many extremely ugly variations on
Expand|Select|Wrap|Line Numbers
  1. my $pattern = "[Mm]y[Dd]igits=[0-9]*[^0-9]";
  2. foreach (@array_of_data){
  3.     if ( /$pattern/ ){
  4.     print "$_\n";
but if I don't get an error, all I get is a spew-out of the entire HTML doc instead of what I am hoping for, which would be a printout or file that looks like:

219824
2230239084
04598
98739874
etc., etc.

or better yet, assign the output to an array that looks like:

@desired_array = ( 219824, 2230239084, 04598, 98739874);

I know I must be missing something very fundamental, so if anyone can help steer me away from the major mistakes I'm making, I'd appreciate it. Thanks.

KevinADC's Avatar
Expert
 
Join Date: Jan 2007
Location: Southern California USA
Posts: 4,091
#2: Jan 28 '07

re: Need help with some very Practical Extraction


as long as the pattern is on the same line this should work:


Expand|Select|Wrap|Line Numbers
  1. my @digits = ();
  2. foreach (@array_of_data){
  3.     if ( /mydigits=(\d+)/i ){
  4.        print "found $1 in this line: $_\n";
  5.        push @digits,$1;
  6.     }
  7. }
  8. print "$_\n" for @digits;
  9.  

but can be changed if the pattern is broken over multiple lines.
Newbie
 
Join Date: Jan 2007
Posts: 5
#3: Jan 28 '07

re: Need help with some very Practical Extraction


Wow, thank you, that is very helpful. Love the i modifier for case insensitivity!

As I suspected, the HTML doc looks like one long line to Perl, so what happens is it finds the first instance, say, 123456789, prints

"found 123456789 in this line:"

followed by what to you and me looks like more than 3000 lines of HTML, then prints:

"123456789"

and then quits. But that is more than I could get it to do before, and this definitely has me pointed in the right direction, so thanks again. :)
Newbie
 
Join Date: Jan 2007
Posts: 5
#4: Jan 28 '07

re: Need help with some very Practical Extraction


Wow, I just realized what you did with the parentheses and the $1 to extract only the digits. Awesome!
KevinADC's Avatar
Expert
 
Join Date: Jan 2007
Location: Southern California USA
Posts: 4,091
#5: Jan 28 '07

re: Need help with some very Practical Extraction


See how this works:

Expand|Select|Wrap|Line Numbers
  1. use WWW::Mechanize;
  2. $url = "http://someurl";
  3. my $mechanize = WWW::Mechanize->new(autocheck => 1);
  4. $mechanize->get($url);
  5. my $string_of_data = $mechanize->content;
  6. my @digits = $string_of_data =~ m/mydigits=(\d+)/igm;
  7. print "$_\n" for @digits;
  8.  
if that doesn't work, change the 'm' after 'ig' to an 's'
Reply