Creating an HTML parser using PHP

hello2008

14 New Member

Hi,

I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

Thanks in advance!

Jan 3 '08 #1

Subscribe Reply

2611

karlectomy

New Member

Hi,

I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

Thanks in advance!

Are you parsing HTML files that are already created?

You might want to check this out:
http://us3.php.net/manual/en/function.file.php

Jan 3 '08 #2

hello2008

New Member

Are you parsing HTML files that are already created?

You might want to check this out:
http://us3.php.net/manual/en/function.file.php

Hi karlectomy,

Thanks for replying. Yes, I am parsing HTML files that are already created. But the files are created from PDFs and what I need to do is read the HTML file, extract all it's elements' contents, make a query with those extracted values and insert the records into the database. I am using regular expressions for the HTML tags

So far my code is as foll:

[PHP]
<?php
$page_title = "n/a";
$meta_descr = "n/a";
$meta_keywd = "n/a";

if ($handle = @fopen("temp.ht ml", "r")) {
$content = "";
while (!feof($handle) ) {
$part = fread($handle, 1024);
$content .= $part;
if (eregi("</head>", $part)) break;
}
fclose($handle) ;
$lines = preg_split("/\r?\n|\r/", $content); // turn the content in rows
$is_title = false;
$is_author = false;
$is_descr = false;
$is_keywd = false;
$close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01
foreach ($lines as $val) {
if (eregi("<title> (.*)</title>", $val, $title)) {
$page_title = $title[1];
echo 'page_title: ' . $page_title;

$is_title = true;
}
if (eregi("<meta name=\"author\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $author)) {
$page_author = $author[1];
echo 'page_author: ' . $page_author;

$is_author = true;
}
if (eregi("<meta name=\"descript ion\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $descr)) {
$meta_descr = $descr[1];
echo 'meta_descr: ' . $meta_descr;

$is_descr = true;
}
if (eregi("<meta name=\"keywords \" content=\"(.*)\ "([[:space:]]?/)?>", $val, $keywd)) {
$meta_keywd = $keywd[1];
echo 'meta_keywd: ' . $meta_keywd;

$is_keywd = true;
}
if ($is_title && $is_author && $is_descr && $is_keywd) break;
}
}
?>

[/PHP]

But this only parses the <HEAD></HEAD> tag, parsing the <BODY></BODY> is a real challenge and I needed help with that.

Thanks,
sasha

Jan 4 '08 #3

karlectomy

New Member

I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

It looks like you know what you're doing.

Jan 7 '08 #4

hello2008

New Member

I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

It looks like you know what you're doing.

Thanks :)
I am not that well versed with Regex. Hence it's taking more time than I expected. I am glad a lot of online help is available.

Jan 14 '08 #5

Similar topics

2720

Erroneous Text Extraction using HTML::Parser

by: Himanshu Garg | last post by:

Hello, I am using HTML::Parser to extract text from html pages from http://bbc.co.uk/urdu/ However the encoding of the input text seems to change to some unknown encoding in the output. The program is given below. The HTML is in a string to keep the example simple. The same problem appears with HTML in a file.

Perl

3125

Where to look for source of HTML::Parser

by: Himanshu Garg | last post by:

Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory (/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/HTML/) doesn't seem to contain the "real" parsing statements.

Perl

4401

lexical analysis of html

by: bariole | last post by:

Hi I am trying to make lexical analysis of some simplified html code with flex tool. However that kind of work is new to me and I don't know where to start. I have searched a web but I didn't find anything useful. I found tools like LEXHTML.CXX library but I have no need for that. What I need is simple overview of working ideas of most usual html lexical analysators like ones inside IE or Gecko. Something like good

HTML / CSS

7119

Creating XHTML in an iframe

by: Mcginkel | last post by:

I am trying to find a way to load XHTML content in an Iframe. I use to do this in html by using the following code : var iframeObject = document.createElement("iframe"); MyDiv.appendChild(iframeObject); var data = "<html><head><title>testing</title></head><body>data</body></html>" iframeObject.contentDocument.open(); iframeObject.contentDocument.writeln(data);

Javascript

2680

Load HTML in text strings into HTML parser in Javascript

by: David Virgil Hobbs | last post by:

Loading text strings containing HTML code into an HTML parser in a Javascript/Jscript I would like to know, how one would go about loading a text string containing HTML code, so as to be able to use javascript or Jscript to work with the HTML code in the text string, in the same way that one works with XML code in a text string using the XML parser. If I was able to load the text string containing the HTML code succesfully, I would be...

.NET Framework

1997

No parsing-result of HTML into XHTML

by: june | last post by:

Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile = "file:\\C:/work/Eclipse3.1.1/html-file.html"; org.xml.sax.InputSource pSource = new InputSource(aHTMLFile);

XML

2506

Creating DOM elements in IE before page complete

by: dd | last post by:

Has anyone found a way around the problem IE has if you create elements (script or div, doesn't seem to matter) BEFORE the document.readyState is "complete" ? I know you can sometimes get away with only waiting for "interactive" state, but I've found that on some pages, that can result in a weird error with a dialog box along the lines of "Internet operation aborted error" (or something like that, I haven't seen it for a long time since...

Javascript

6891

Creating an array from an HTML table

by: axlq | last post by:

Before I try to do this myself (I remember doing it in Java years ago and it was a pain).... Has anyone run across a function that will take a string parameter containing an HTML table, and return a 2-dimensional array with each element corresponding to the contents of a table cell? I see plenty of examples of doing the opposite: convert an array to an HTML table. I want to go the other way, from an HTML table to an array.

PHP

8150

SAX Parser for creating (not parsing) XML document

by: Naresh Agarwal | last post by:

Hi I have been using DOM parser to create XML documents. I want to use an alternate mechanism to create XML document as size of my XML document is large and I don;t want the overhead of DOM (where entire tree in constructed in memory). Like SAX APIs for parsing XML documents, Is there any thing like SAX parser/APIs for creating XML documents?

.NET Framework

9553

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10509

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

10256

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

10039

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

7584

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6824

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5612

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4152

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

3765

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP