By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,492 Members | 1,274 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,492 IT Pros & Developers. It's quick & easy.

Help with a regular expression

P: n/a
Hi

I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)

Make sense.
It is part way there, just need some help finishing it.

$filename = "form-eg.php"; // Open file to read HTML with Form code
$fd = fopen ($filename, "rb");
$contents = fread ($fd, filesize ($filename));
preg_match_all ('/<input.*?name\\s*=\\s*"?([^\\s>"]*)/i', $contents,
$matches); // get all input fields and attributes and values

for ($i=0; $i< count($matches[0]); $i++) {
echo "matched: ".$matches[0][$i]."<br />\n";
echo "NAME: ".$matches[1][$i]."<br />\n";
echo "TYPE: ".$matches[3][$i]."<br />\n";
echo "VALUE: ".$matches[4][$i]."<br />\n\n";
}

fclose ($fd);

I will also need to run another check for :
<select
<textarea

But I can probably figure that out from what I already have.

Thanks,

YoBro
Jul 17 '05 #1
Share this Question
Share on Google+
7 Replies


P: n/a
YoBro wrote:
I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)


But I follow a different path than you.

<?php
// initialize result data
$html_input = array();
$html_index = 0;

// get HTML
$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);

// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
// save them
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;
}

// done, deal with them anyway I like
echo '<pre>'; print_r($html_inputs); echo '</pre>';
?>
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #2

P: n/a
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
I believe the principal reason why pre-written parsers are suggested
and recommended instead of impromptu regular expression "one-liners"
is that the gurus who've written and developed the parsers are
usually aware of and understand the rules; the "one-line" regex
implementors, on the other hand -- with all due respect -- generally
aren't and don't. I'm not going to pretend I understand everything
SGML; I certainly don't; I'm far too young for starters.

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML. They
changed my mind, anyway. You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).

(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)
// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);
There's the standard mistake: the next occurrence of ">" does not
necessarily mark the end of the tag. In HTML, a ">" can appear in
*quoted* attribute values; it cannot appear in unquoted attribute
values. This, for example, is a valid INPUT element (I make no
claims to its logicality!)

<INPUT title=">">

Also, INPUTs have no required attributes (that is, "<INPUT>" is
valid), but the "+" quantifier matches *one* or more of whatever came
before. To over-simplistically match INPUTs, I'd substitute "*" for
"+". Since you're only wanting to match those INPUTs with explicit
type, name and value attributes, however, that's inconsequential.
// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
An SGML name begins with a name start character and is followed by
zero or more name characters. You'd match a name, for HTML4.01, with
the pattern

[a-zA-Z][a-zA-Z0-9.-_:]*

An attribute value may be of length zero, so, again, the quantifier
"*" ought to be used. And inside quoted attribute values, both "<"
and ">" can appear. Alvaro G Vicario has just pointed this out too,
in an article in the thread "php sticky forms",

<news:1q*******************************@40tude.net >.
// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
Ditto.
// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);


Unquoted attribute values may only contain name characters. In
HTML4.01, the pattern

[a-zA-Z0-9.-_:]*

matches name characters.

Phew!

Refs.:

http://www.w3.org/TR/html401/sgml/sgmldecl.html
http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm

--
Jock
Jul 17 '05 #3

P: n/a
John Dunlop wrote:
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML.
Appreciate it.
They changed my mind, anyway.
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!
You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).
Same reason I'm not changing them, I guess :-)
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)


Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #4

P: n/a
Any idea of some real life working examples to do it the SGML way. Something
I have never heard of before.

The reference links appear to have no relevance to what I am trying to do.

There is a php function xml_parse, could this be used?
The documentation is light on that topic.

Thanks!

"Pedro Graca" <he****@hotpop.com> wrote in message
news:c2*************@ID-203069.news.uni-berlin.de...
John Dunlop wrote:
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this job (try a HTML or XML parser) I do very well with them myself :)

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML.


Appreciate it.
They changed my mind, anyway.


Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!
You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).


Same reason I'm not changing them, I guess :-)
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)


Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #5

P: n/a
I (Pedro Graca) wrote:
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!


Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)
Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #6

P: n/a
Hi,

Thanks, that is very helpful.
I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?
('htmlparser.inc.php')
to: yo***@wazzup.co.nz.

YoBro!


"Pedro Graca" <he****@hotpop.com> wrote in message
news:c2*************@ID-203069.news.uni-berlin.de...
I (Pedro Graca) wrote:
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!


Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)
Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #7

P: n/a
YoBro top-posted:
I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?


Try here first :)
https://sourceforge.net/project/show...group_id=91649
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #8

This discussion thread is closed

Replies have been disabled for this discussion.