By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
424,984 Members | 1,045 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 424,984 IT Pros & Developers. It's quick & easy.

Creating an HTML parser using PHP

P: 14
Hi,

I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

Thanks in advance!
Jan 3 '08 #1
Share this Question
Share on Google+
4 Replies


karlectomy
P: 64
Hi,

I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

Thanks in advance!
Are you parsing HTML files that are already created?

You might want to check this out:
http://us3.php.net/manual/en/function.file.php
Jan 3 '08 #2

P: 14
Are you parsing HTML files that are already created?

You might want to check this out:
http://us3.php.net/manual/en/function.file.php
Hi karlectomy,

Thanks for replying. Yes, I am parsing HTML files that are already created. But the files are created from PDFs and what I need to do is read the HTML file, extract all it's elements' contents, make a query with those extracted values and insert the records into the database. I am using regular expressions for the HTML tags

So far my code is as foll:

[PHP]
<?php
$page_title = "n/a";
$meta_descr = "n/a";
$meta_keywd = "n/a";


if ($handle = @fopen("temp.html", "r")) {
$content = "";
while (!feof($handle)) {
$part = fread($handle, 1024);
$content .= $part;
if (eregi("</head>", $part)) break;
}
fclose($handle);
$lines = preg_split("/\r?\n|\r/", $content); // turn the content in rows
$is_title = false;
$is_author = false;
$is_descr = false;
$is_keywd = false;
$close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01
foreach ($lines as $val) {
if (eregi("<title>(.*)</title>", $val, $title)) {
$page_title = $title[1];
echo 'page_title: ' . $page_title;

$is_title = true;
}
if (eregi("<meta name=\"author\" content=\"(.*)\"([[:space:]]?/)?>", $val, $author)) {
$page_author = $author[1];
echo 'page_author: ' . $page_author;

$is_author = true;
}
if (eregi("<meta name=\"description\" content=\"(.*)\"([[:space:]]?/)?>", $val, $descr)) {
$meta_descr = $descr[1];
echo 'meta_descr: ' . $meta_descr;

$is_descr = true;
}
if (eregi("<meta name=\"keywords\" content=\"(.*)\"([[:space:]]?/)?>", $val, $keywd)) {
$meta_keywd = $keywd[1];
echo 'meta_keywd: ' . $meta_keywd;

$is_keywd = true;
}
if ($is_title && $is_author && $is_descr && $is_keywd) break;
}
}
?>

[/PHP]

But this only parses the <HEAD></HEAD> tag, parsing the <BODY></BODY> is a real challenge and I needed help with that.

Thanks,
sasha
Jan 4 '08 #3

karlectomy
P: 64
I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

It looks like you know what you're doing.
Jan 7 '08 #4

P: 14
I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

It looks like you know what you're doing.
Thanks :)
I am not that well versed with Regex. Hence it's taking more time than I expected. I am glad a lot of online help is available.
Jan 14 '08 #5

Post your reply

Sign in to post your reply or Sign up for a free account.