The function get_meta_tags reads a file or URL and returns an array,
one element for each META tag in the HEAD element of the document.
The keys of this array are the values of the name attribute for
each tag, the value of the array is, for any META tag, the text
contained in the content attribute of the tag. So for example,
given a file or url whose head element contains
<meta name="keywords" content="PHP language; HTML meta tag; question">
the function would return
$result = array (
"keywords" => "PHP language; HTML meta tag; question"
);
All well and good. However there is no reason to expect the value
of the name attribute to be unique within an HTML page. I could
expect to see
<meta name="publisher" content="University of Kansas">
<meta name="keywords" content="fusulinids, ostracodes, brachiopods">
<meta name="keywords" content="Lawrence, Douglas County, Kansas, USA">
Faced with this HTML, get_meta_tags returns only the second, because
it calls
add_assoc_string (return_value, name, value, 0)
without inquiring whether the same name has been used already.
(this is in ext/standard/file.c)
Questions
1. Should I suggest that this function be changed so that, upon
encountering an array name for the second time, it changes its
return value to include a nested array, as follows?
$result = array (
"publisher" => "University of Kansas",
"keywords" => array (
0 => "fusulinids, ostracodes, brachiopods",
1 => "Lawrence, Douglas County, Kansas, USA"
)
);
or is there some reason why this would be unacceptable or
not feasible?
2. Anybody got a better way to extract meta tags from HTML pages?
Thanks for any and all help.
Peter
--
Peter N. Schweitzer (MS 954, U.S. Geological Survey, Reston, VA 20192)
(703) 648-6533 FAX: (703) 648-6252 email:
pschweitzer@usgs.gov
<http://geology.usgs.gov/peter/>