473,800 Members | 2,647 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Creating an HTML parser using PHP

14 New Member
Hi,

I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

Thanks in advance!
Jan 3 '08 #1
4 2611
karlectomy
64 New Member
Hi,

I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

Thanks in advance!
Are you parsing HTML files that are already created?

You might want to check this out:
http://us3.php.net/manual/en/function.file.php
Jan 3 '08 #2
hello2008
14 New Member
Are you parsing HTML files that are already created?

You might want to check this out:
http://us3.php.net/manual/en/function.file.php
Hi karlectomy,

Thanks for replying. Yes, I am parsing HTML files that are already created. But the files are created from PDFs and what I need to do is read the HTML file, extract all it's elements' contents, make a query with those extracted values and insert the records into the database. I am using regular expressions for the HTML tags

So far my code is as foll:

[PHP]
<?php
$page_title = "n/a";
$meta_descr = "n/a";
$meta_keywd = "n/a";


if ($handle = @fopen("temp.ht ml", "r")) {
$content = "";
while (!feof($handle) ) {
$part = fread($handle, 1024);
$content .= $part;
if (eregi("</head>", $part)) break;
}
fclose($handle) ;
$lines = preg_split("/\r?\n|\r/", $content); // turn the content in rows
$is_title = false;
$is_author = false;
$is_descr = false;
$is_keywd = false;
$close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01
foreach ($lines as $val) {
if (eregi("<title> (.*)</title>", $val, $title)) {
$page_title = $title[1];
echo 'page_title: ' . $page_title;

$is_title = true;
}
if (eregi("<meta name=\"author\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $author)) {
$page_author = $author[1];
echo 'page_author: ' . $page_author;

$is_author = true;
}
if (eregi("<meta name=\"descript ion\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $descr)) {
$meta_descr = $descr[1];
echo 'meta_descr: ' . $meta_descr;

$is_descr = true;
}
if (eregi("<meta name=\"keywords \" content=\"(.*)\ "([[:space:]]?/)?>", $val, $keywd)) {
$meta_keywd = $keywd[1];
echo 'meta_keywd: ' . $meta_keywd;

$is_keywd = true;
}
if ($is_title && $is_author && $is_descr && $is_keywd) break;
}
}
?>

[/PHP]

But this only parses the <HEAD></HEAD> tag, parsing the <BODY></BODY> is a real challenge and I needed help with that.

Thanks,
sasha
Jan 4 '08 #3
karlectomy
64 New Member
I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

It looks like you know what you're doing.
Jan 7 '08 #4
hello2008
14 New Member
I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

It looks like you know what you're doing.
Thanks :)
I am not that well versed with Regex. Hence it's taking more time than I expected. I am glad a lot of online help is available.
Jan 14 '08 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

0
2720
by: Himanshu Garg | last post by:
Hello, I am using HTML::Parser to extract text from html pages from http://bbc.co.uk/urdu/ However the encoding of the input text seems to change to some unknown encoding in the output. The program is given below. The HTML is in a string to keep the example simple. The same problem appears with HTML in a file.
3
3125
by: Himanshu Garg | last post by:
Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory (/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/HTML/) doesn't seem to contain the "real" parsing statements.
4
4401
by: bariole | last post by:
Hi I am trying to make lexical analysis of some simplified html code with flex tool. However that kind of work is new to me and I don't know where to start. I have searched a web but I didn't find anything useful. I found tools like LEXHTML.CXX library but I have no need for that. What I need is simple overview of working ideas of most usual html lexical analysators like ones inside IE or Gecko. Something like good
16
7119
by: Mcginkel | last post by:
I am trying to find a way to load XHTML content in an Iframe. I use to do this in html by using the following code : var iframeObject = document.createElement("iframe"); MyDiv.appendChild(iframeObject); var data = "<html><head><title>testing</title></head><body>data</body></html>" iframeObject.contentDocument.open(); iframeObject.contentDocument.writeln(data);
2
2680
by: David Virgil Hobbs | last post by:
Loading text strings containing HTML code into an HTML parser in a Javascript/Jscript I would like to know, how one would go about loading a text string containing HTML code, so as to be able to use javascript or Jscript to work with the HTML code in the text string, in the same way that one works with XML code in a text string using the XML parser. If I was able to load the text string containing the HTML code succesfully, I would be...
0
1997
by: june | last post by:
Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile = "file:\\C:/work/Eclipse3.1.1/html-file.html"; org.xml.sax.InputSource pSource = new InputSource(aHTMLFile);
8
2506
by: dd | last post by:
Has anyone found a way around the problem IE has if you create elements (script or div, doesn't seem to matter) BEFORE the document.readyState is "complete" ? I know you can sometimes get away with only waiting for "interactive" state, but I've found that on some pages, that can result in a weird error with a dialog box along the lines of "Internet operation aborted error" (or something like that, I haven't seen it for a long time since...
9
6891
by: axlq | last post by:
Before I try to do this myself (I remember doing it in Java years ago and it was a pain).... Has anyone run across a function that will take a string parameter containing an HTML table, and return a 2-dimensional array with each element corresponding to the contents of a table cell? I see plenty of examples of doing the opposite: convert an array to an HTML table. I want to go the other way, from an HTML table to an array.
6
8150
by: Naresh Agarwal | last post by:
Hi I have been using DOM parser to create XML documents. I want to use an alternate mechanism to create XML document as size of my XML document is large and I don;t want the overhead of DOM (where entire tree in constructed in memory). Like SAX APIs for parsing XML documents, Is there any thing like SAX parser/APIs for creating XML documents?
0
9553
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10509
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
1
10256
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
10039
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
1
7584
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6824
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5612
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4152
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
2
3765
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.