Hi,
What I'm trying to do seems right up Perl's alley, but I can't get it to work. I'm using the WWW::Mechanize module to retrieve a sprawling HTML document from which I want to extract certain strings and save them. I can get this much to work:
- use WWW::Mechanize;
-
$url = "http://someurl";
-
my $mechanize = WWW::Mechanize->new(autocheck => 1);
-
$mechanize->get($url);
-
my @array_of_data = $mechanize->content;
-
but now I am stuck on how to process that data.
The HTML doc is quite long, and contains numbers that I want to extract, numbers that are always preceeded by a text string that is the same each time, such as:
<a href bla bla bla>bla bla bla<random tag>
mydigits=493409834%bla bla bla<meaningless tag>bla bla</a>
where the string "mydigits=" always preceeds the desired number and is sometimes all lowercase but can occasionally look like "MyDigits="; where the number itself may be anywhere from one to 10 digits in length; and where "%" might literally be "%" or any other non-digit character including a space. Moreover, the desired string might appear more than once per line -- assuming Perl doesn't see the HTML doc as just one single long line of text anyway.
What I have tried is many extremely ugly variations on
- my $pattern = "[Mm]y[Dd]igits=[0-9]*[^0-9]";
-
foreach (@array_of_data){
-
if ( /$pattern/ ){
-
print "$_\n";
but if I don't get an error, all I get is a spew-out of the entire HTML doc instead of what I am hoping for, which would be a printout or file that looks like:
219824
2230239084
04598
98739874
etc., etc.
or better yet, assign the output to an array that looks like:
@desired_array = ( 219824, 2230239084, 04598, 98739874);
I know I must be missing something very fundamental, so if anyone can help steer me away from the major mistakes I'm making, I'd appreciate it. Thanks.