Help with a regular expression

YoBro

Hi

I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)

Make sense.
It is part way there, just need some help finishing it.

$filename = "form-eg.php"; // Open file to read HTML with Form code
$fd = fopen ($filename, "rb");
$contents = fread ($fd, filesize ($filename));
preg_match_all ('/<input.*?name\\s*=\\s*"?([^\\s>"]*)/i', $contents,
$matches); // get all input fields and attributes and values

for ($i=0; $i< count($matches[0]); $i++) {
echo "matched: ".$matches[0][$i]."<br />\n";
echo "NAME: ".$matches[1][$i]."<br />\n";
echo "TYPE: ".$matches[3][$i]."<br />\n";
echo "VALUE: ".$matches[4][$i]."<br />\n\n";
}

fclose ($fd);

I will also need to run another check for :
<select
<textarea

But I can probably figure that out from what I already have.

Thanks,

YoBro

Jul 17 '05 #1

Subscribe Post Reply

2566

Pedro Graca

YoBro wrote:

I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)

But I follow a different path than you.

<?php
// initialize result data
$html_input = array();
$html_index = 0;

// get HTML
$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);

// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
// save them
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;
}

// done, deal with them anyway I like
echo '<pre>'; print_r($html_inputs); echo '</pre>';
?>
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #2

John Dunlop

Pedro Graca wrote:

Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
I believe the principal reason why pre-written parsers are suggested
and recommended instead of impromptu regular expression "one-liners"
is that the gurus who've written and developed the parsers are
usually aware of and understand the rules; the "one-line" regex
implementors, on the other hand -- with all due respect -- generally
aren't and don't. I'm not going to pretend I understand everything
SGML; I certainly don't; I'm far too young for starters.

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML. They
changed my mind, anyway. You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).

(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)
// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);
There's the standard mistake: the next occurrence of ">" does not
necessarily mark the end of the tag. In HTML, a ">" can appear in
*quoted* attribute values; it cannot appear in unquoted attribute
values. This, for example, is a valid INPUT element (I make no
claims to its logicality!)

<INPUT title=">">

Also, INPUTs have no required attributes (that is, "<INPUT>" is
valid), but the "+" quantifier matches *one* or more of whatever came
before. To over-simplistically match INPUTs, I'd substitute "*" for
"+". Since you're only wanting to match those INPUTs with explicit
type, name and value attributes, however, that's inconsequential.
// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
An SGML name begins with a name start character and is followed by
zero or more name characters. You'd match a name, for HTML4.01, with
the pattern

[a-zA-Z][a-zA-Z0-9.-_:]*

An attribute value may be of length zero, so, again, the quantifier
"*" ought to be used. And inside quoted attribute values, both "<"
and ">" can appear. Alvaro G Vicario has just pointed this out too,
in an article in the thread "php sticky forms",

<news:1q*******************************@40tude.net >.
// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
Ditto.
// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);

Unquoted attribute values may only contain name characters. In
HTML4.01, the pattern

[a-zA-Z0-9.-_:]*

matches name characters.

Phew!

Refs.:

http://www.w3.org/TR/html401/sgml/sgmldecl.html
http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm

--
Jock

Jul 17 '05 #3

Pedro Graca

John Dunlop wrote:

Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML.
Appreciate it.
They changed my mind, anyway.
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!
You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).
Same reason I'm not changing them, I guess :-)
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)

Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #4

YoBro

Any idea of some real life working examples to do it the SGML way. Something
I have never heard of before.

The reference links appear to have no relevance to what I am trying to do.

There is a php function xml_parse, could this be used?
The documentation is light on that topic.

Thanks!

"Pedro Graca" <he****@hotpop.com> wrote in message
news:c2*************@ID-203069.news.uni-berlin.de...

John Dunlop wrote:
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this job (try a HTML or XML parser) I do very well with them myself :)

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML.

Appreciate it.
They changed my mind, anyway.

Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!
You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).

Same reason I'm not changing them, I guess :-)
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)

Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #5

Pedro Graca

I (Pedro Graca) wrote:

Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!

Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)
Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #6

YoBro

Hi,

Thanks, that is very helpful.
I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?
('htmlparser.inc.php')
to: yo***@wazzup.co.nz.

YoBro!

"Pedro Graca" <he****@hotpop.com> wrote in message
news:c2*************@ID-203069.news.uni-berlin.de...

I (Pedro Graca) wrote:
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!

Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)
Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #7

Pedro Graca

YoBro top-posted:

I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?

Try here first :)
https://sourceforge.net/project/show...group_id=91649
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #8

by: Steve | last post by:

Hello, I am writing a script that calls a URL and reads the resulting HTML into a function that strips out everthing and returns ONLY the links, this is so that I can build a link index of various...

PHP

Help with regular expression?

by: Bradley Plett | last post by:

I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...

.NET Framework

Help needed with a regular expression

by: Neri | last post by:

Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...

C# / C Sharp

Regular Expression Help

by: JohnSouth | last post by:

Hi I've been using a Regular expression to test for valid email addresses. It looks like: \w+(\w+)*@\w+(\w+)*\.\w+(\w+)* I've now had 2 occassions where it has rejected and email address...

C# / C Sharp

Need help understanding regular expression

by: Joe | last post by:

Hi, I have been using a regular expression that I donâ€™t uite understand to filter the valid email address. My regular expression is as follows: <asp:RegularExpressionValidator...

ASP.NET

anybody help me

by: Rahul | last post by:

Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am...

Python

Regular Expression help

by: Zach | last post by:

Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...

C# / C Sharp

Need help in forming a regular expression using regex_replace

by: deepak_kamath_n | last post by:

Hello, I am relatively new to the world of regex and require some help in forming a regular expression to achieve the following: I have an input stream similar to: Slot: slot1 Description:...

C / C++

need some regular expression help

by: Chris | last post by:

I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = Can anybody help me out? Thanks for any help!

Python

Help with a Regular Expression

by: Mr.Steskal | last post by:

Posted: Wed Jul 11, 2007 7:01 am Post subject: Regular Expression Help -------------------------------------------------------------------------------- I need help writing a regular...

Javascript

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Help with a regular expression

Similar topics