Are you parsing HTML files that are already created?
You might want to check this out:
http://us3.php.net/manual/en/function.file.php
Hi karlectomy,
Thanks for replying. Yes, I am parsing HTML files that are already created. But the files are created from PDFs and what I need to do is read the HTML file, extract all it's elements' contents, make a query with those extracted values and insert the records into the database. I am using regular expressions for the HTML tags
So far my code is as foll:
[PHP]
<?php
$page_title = "n/a";
$meta_descr = "n/a";
$meta_keywd = "n/a";
if ($handle = @fopen("temp.ht ml", "r")) {
$content = "";
while (!feof($handle) ) {
$part = fread($handle, 1024);
$content .= $part;
if (eregi("</head>", $part)) break;
}
fclose($handle) ;
$lines = preg_split("/\r?\n|\r/", $content); // turn the content in rows
$is_title = false;
$is_author = false;
$is_descr = false;
$is_keywd = false;
$close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01
foreach ($lines as $val) {
if (eregi("<title> (.*)</title>", $val, $title)) {
$page_title = $title[1];
echo 'page_title: ' . $page_title;
$is_title = true;
}
if (eregi("<meta name=\"author\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $author)) {
$page_author = $author[1];
echo 'page_author: ' . $page_author;
$is_author = true;
}
if (eregi("<meta name=\"descript ion\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $descr)) {
$meta_descr = $descr[1];
echo 'meta_descr: ' . $meta_descr;
$is_descr = true;
}
if (eregi("<meta name=\"keywords \" content=\"(.*)\ "([[:space:]]?/)?>", $val, $keywd)) {
$meta_keywd = $keywd[1];
echo 'meta_keywd: ' . $meta_keywd;
$is_keywd = true;
}
if ($is_title && $is_author && $is_descr && $is_keywd) break;
}
}
?>
[/PHP]
But this only parses the <HEAD></HEAD> tag, parsing the <BODY></BODY> is a real challenge and I needed help with that.
Thanks,
sasha