473,406 Members | 2,705 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Help with a regular expression

Hi

I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)

Make sense.
It is part way there, just need some help finishing it.

$filename = "form-eg.php"; // Open file to read HTML with Form code
$fd = fopen ($filename, "rb");
$contents = fread ($fd, filesize ($filename));
preg_match_all ('/<input.*?name\\s*=\\s*"?([^\\s>"]*)/i', $contents,
$matches); // get all input fields and attributes and values

for ($i=0; $i< count($matches[0]); $i++) {
echo "matched: ".$matches[0][$i]."<br />\n";
echo "NAME: ".$matches[1][$i]."<br />\n";
echo "TYPE: ".$matches[3][$i]."<br />\n";
echo "VALUE: ".$matches[4][$i]."<br />\n\n";
}

fclose ($fd);

I will also need to run another check for :
<select
<textarea

But I can probably figure that out from what I already have.

Thanks,

YoBro
Jul 17 '05 #1
7 2566
YoBro wrote:
I have used some of this code from the PHP manual, but I am bloody hopeless
with regular expressions.
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
Was hoping somebody could offer a hand.

The output of this will put the name of a form field beside name.
I want to get the following but not sure how to modify the code below.
1. Field Name (to appear beside NAME:)
2. Field Type (to appear beside TYPE:)
3. Field Value (to appear beside VALUE:)


But I follow a different path than you.

<?php
// initialize result data
$html_input = array();
$html_index = 0;

// get HTML
$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);

// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
// save them
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;

// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);
foreach ($matches[0] as $k=>$dummy) {
$html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
}
++$html_index;
}

// done, deal with them anyway I like
echo '<pre>'; print_r($html_inputs); echo '</pre>';
?>
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #2
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
I believe the principal reason why pre-written parsers are suggested
and recommended instead of impromptu regular expression "one-liners"
is that the gurus who've written and developed the parsers are
usually aware of and understand the rules; the "one-line" regex
implementors, on the other hand -- with all due respect -- generally
aren't and don't. I'm not going to pretend I understand everything
SGML; I certainly don't; I'm far too young for starters.

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML. They
changed my mind, anyway. You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).

(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)
// get all "<input ... >"s -- usually I'd group them by <form>s too
preg_match_all('@(<input[^>]+>)@Ui', $contents, $inputs);
There's the standard mistake: the next occurrence of ">" does not
necessarily mark the end of the tag. In HTML, a ">" can appear in
*quoted* attribute values; it cannot appear in unquoted attribute
values. This, for example, is a valid INPUT element (I make no
claims to its logicality!)

<INPUT title=">">

Also, INPUTs have no required attributes (that is, "<INPUT>" is
valid), but the "+" quantifier matches *one* or more of whatever came
before. To over-simplistically match INPUTs, I'd substitute "*" for
"+". Since you're only wanting to match those INPUTs with explicit
type, name and value attributes, however, that's inconsequential.
// inside each "<input ... >" isolate the pairs "attr=value"
foreach ($inputs[1] as $input) {
// once for double quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
An SGML name begins with a name start character and is followed by
zero or more name characters. You'd match a name, for HTML4.01, with
the pattern

[a-zA-Z][a-zA-Z0-9.-_:]*

An attribute value may be of length zero, so, again, the quantifier
"*" ought to be used. And inside quoted attribute values, both "<"
and ">" can appear. Alvaro G Vicario has just pointed this out too,
in an article in the thread "php sticky forms",

<news:1q*******************************@40tude.net >.
// once for single quoted values
preg_match_all('@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
Ditto.
// and once again for unquoted values
preg_match_all('@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);


Unquoted attribute values may only contain name characters. In
HTML4.01, the pattern

[a-zA-Z0-9.-_:]*

matches name characters.

Phew!

Refs.:

http://www.w3.org/TR/html401/sgml/sgmldecl.html
http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm

--
Jock
Jul 17 '05 #3
John Dunlop wrote:
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this
job (try a HTML or XML parser) I do very well with them myself :)
I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML.
Appreciate it.
They changed my mind, anyway.
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!
You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).
Same reason I'm not changing them, I guess :-)
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)


Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #4
Any idea of some real life working examples to do it the SGML way. Something
I have never heard of before.

The reference links appear to have no relevance to what I am trying to do.

There is a php function xml_parse, could this be used?
The documentation is light on that topic.

Thanks!

"Pedro Graca" <he****@hotpop.com> wrote in message
news:c2*************@ID-203069.news.uni-berlin.de...
John Dunlop wrote:
Pedro Graca wrote:
Although I've heard often enough that RXs are not the best tool for this job (try a HTML or XML parser) I do very well with them myself :)

I'd like to pass a few comments, nevertheless, which might change
your mind about regular expressions for parsing (X)HTML.


Appreciate it.
They changed my mind, anyway.


Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!
You'll understand though, hopefully, why I
haven't offered any regular expression in place of yours (no, it's
not because I couldn't be bothered :-)).


Same reason I'm not changing them, I guess :-)
(Trying to cope with shorthand markup when using regexes would be a
nightmare. Unlike proper parsers, I'm going to act like a browser
and ignore shorthand markup, for the time being, as it'd complicate
matters even more.)


Don't even mention that.

(snip very good content)
Thank you John. Thank you very much.
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #5
I (Pedro Graca) wrote:
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!


Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)
Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #6
Hi,

Thanks, that is very helpful.
I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?
('htmlparser.inc.php')
to: yo***@wazzup.co.nz.

YoBro!


"Pedro Graca" <he****@hotpop.com> wrote in message
news:c2*************@ID-203069.news.uni-berlin.de...
I (Pedro Graca) wrote:
Changed my mind, too. Will take a little longer to change my scripts.
But new scripts will not use regular expressions!


Ufffffff. This took longer than I expected.

The XML parser included with PHP gives errors for many of the pages I
tested (most of them were HTML pages, so it's understandable :).

I found a parser for HTML I like @ http://php-html.sourceforge.net/

#v+
<?php
include 'htmlparser.inc.php'; // Yes! I changed the name
// also changed short php tag

$contents = file_get_contents('http://www.faqs.org/rfcs/index.html');

$parser = new HtmlParser($contents);
while ($parser->parse()) {
if (strtolower($parser->iNodeName) == 'input') {

#echo "\niNodeType: "; print_r($parser->iNodeType);
#echo "\niNodeName: "; print_r($parser->iNodeName);
#echo "\niNodeValue: "; print_r($parser->iNodeValue);
echo "\niNodeAttributes: "; print_r($parser->iNodeAttributes);
}
}

echo "\n\nDone!\n";
?>
#v-

and the result of this script is:

iNodeAttributes: Array
(
[name] => query
[size] => 25
)

iNodeAttributes: Array
(
[type] => submit
[value] => Search RFCs
)

iNodeAttributes: Array
(
[name] => display
[size] => 9
)

iNodeAttributes: Array
(
[type] => submit
[value] => Display RFC By Number
)
Done!
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--

Jul 17 '05 #7
YoBro top-posted:
I have tried to download this file but my browser keeps crashing when I get
there.

I don't suppose if you have a copy you could email it to me?


Try here first :)
https://sourceforge.net/project/show...group_id=91649
--
--= my mail box only accepts =--
--= Content-Type: text/plain =--
--= Size below 10001 bytes =--
Jul 17 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

9
by: Steve | last post by:
Hello, I am writing a script that calls a URL and reads the resulting HTML into a function that strips out everthing and returns ONLY the links, this is so that I can build a link index of various...
5
by: Bradley Plett | last post by:
I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...
4
by: Neri | last post by:
Some document processing program I write has to deal with documents that have headers and footers that are unnecessary for the main processing part. Therefore, I'm using a regular expression to go...
6
by: JohnSouth | last post by:
Hi I've been using a Regular expression to test for valid email addresses. It looks like: \w+(\w+)*@\w+(\w+)*\.\w+(\w+)* I've now had 2 occassions where it has rejected and email address...
3
by: Joe | last post by:
Hi, I have been using a regular expression that I don’t uite understand to filter the valid email address. My regular expression is as follows: <asp:RegularExpressionValidator...
1
by: Rahul | last post by:
Hi Everybody I have some problem in my script. please help me. This is script file. I have one *.inq file. I want run this script in XML files. But this script errors shows . If u want i am...
3
by: Zach | last post by:
Hello, Please forgive if this is not the most appropriate newsgroup for this question. Unfortunately I didn't find a newsgroup specific to regular expressions. I have the following regular...
6
by: deepak_kamath_n | last post by:
Hello, I am relatively new to the world of regex and require some help in forming a regular expression to achieve the following: I have an input stream similar to: Slot: slot1 Description:...
14
by: Chris | last post by:
I need a pattern that matches a string that has the same number of '(' as ')': findall( compile('...'), '42^((2x+2)sin(x)) + (log(2)/log(5))' ) = Can anybody help me out? Thanks for any help!
3
by: Mr.Steskal | last post by:
Posted: Wed Jul 11, 2007 7:01 am Post subject: Regular Expression Help -------------------------------------------------------------------------------- I need help writing a regular...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.