By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
432,498 Members | 1,558 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 432,498 IT Pros & Developers. It's quick & easy.

in-line detection of html escape codes

P: n/a
say i have a for loop that would iterate through every character and
put a space between every 80th one, in effect forcing word wrap to
occur. this can be implemented easily using a regular expression.

if i wanted to improve on this, and make it so stuff in url's didn't
count towards that 80 character limit, a regular expression would not
suffice. however, a simple for loop does.

so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

i was thinking i could sorta simulate a finate state machine that
returns the current state when the function is called. the current
state would then be repassed into the finate state machine along with
the next character in the string, and the new state could be returned.
if the state returned is an accept state, we only count all the
characters in the string of characters that was passed to the FSM
once, and if no state is returned, we could all the characters towards
the 80 character limit.

however, i'm not really sure how to implement the above function. one
problem is that there seem to be a lot of html escape codes, and...
yeah...

any help would be appreciated - thanks! :)
Jul 17 '05 #1
Share this Question
Share on Google+
11 Replies


P: n/a
On Wed, 02 Jun 2004 21:05:08 -0700, yawnmoth wrote:
so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.


One word: regular expressions.

--
Trust me, I know what I'm doing. (Sledge Hammer)

Jul 17 '05 #2

P: n/a
On Thu, 03 Jun 2004 00:09:14 -0400, Mladen Gogala
<go****@sbcglobal.net> wrote:
On Wed, 02 Jun 2004 21:05:08 -0700, yawnmoth wrote:
so now i'm currious how to account for html escape codes such as
&nbsp; and &copy;. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.


One word: regular expressions.


because half of what i'm trying to do *can't* be done doing regular
expressions (you can verify this for yourself by using the pumping
lemma on it), why would i want to do the other half in regular
expressions? i want my code to have a big-o effeciency as close to
O(n) as possible - not O(n**3), or whatever.

also, what exact regular expression would you propose? &[^&;]*; isn't
a good one because not just any string of characters between a & and ;
can make an html escape code - only certain ones can. an example of
one that isn't is &asdf;

i suppose i could do something like &(nbsp|amp|gt|lt| etc ); or
&((n(bsp|tilde))|amp);, but... the former isn't going to be uber fast
(especially since i already have to loop through the string, anyway),
and... the later is going to be *very* hard to write, having tons of
paranthesis, being very long, etc.

additionally, i don't know what every single html escape code is.

anyway, as i said before, i think the way to go is to use some
implementation of a finite state machine that returns the current
state for each one character input. regular expressions are
unsuitable for this task because they don't return states, etc.
Jul 17 '05 #3

P: n/a
Regarding this well-known quote, often attributed to yawnmoth's famous "2
Jun 2004 21:05:08 -0700" speech:
say i have a for loop that would iterate through every character and
put a space between every 80th one, in effect forcing word wrap to
occur. this can be implemented easily using a regular expression.

if i wanted to improve on this, and make it so stuff in url's didn't
count towards that 80 character limit, a regular expression would not
suffice. however, a simple for loop does.

so now i'm currious how to account for html escape codes such as
&nbsp; and &copy;. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

i was thinking i could sorta simulate a finate state machine that
returns the current state when the function is called. the current
state would then be repassed into the finate state machine along with
the next character in the string, and the new state could be returned.
if the state returned is an accept state, we only count all the
characters in the string of characters that was passed to the FSM
once, and if no state is returned, we could all the characters towards
the 80 character limit.

however, i'm not really sure how to implement the above function. one
problem is that there seem to be a lot of html escape codes, and...
yeah...

any help would be appreciated - thanks! :)

This isn't tested code (look it over, it's a bit late, locally), but I
think if you html_entity_decode() anything that looks like an HTML entity,
and the result is only one character, than you can safely assume it's a
valid HTML entity.

Ref: http://us3.php.net/manual/en/functio...ity-decode.php

<?php
$instring = '&amp; will encode. &bogus; will not.';
$outstring = '';
$charcount = 0;

for ($i=0; $i<strlen($instring); /* I'll increment $i myself */ ) {
// If it IS something that looks like a character class...
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {
// Isolate it
$testcase = $matchybits[0];

// If it decodes down to one character...
if (strlen(html_entity_decode($testcase)) == 1) {

// increment the charcount variable by one,
// increment the index pointer past the element
// and spit the raw HTML entity out to the output

$charcount++;
$i += strlen($matchybits);
$outstring .= $matchybits;

}
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}
}

if ($charcount == 80) {
$outstring .= " ";
// All that, just to add a space...
}
}

// $instring is unchanged
// $outstring is your output,
// $charcount is the length, without new spaces, of the string
// $i and $matchybits are junk

?>

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #4

P: n/a
Regarding this well-known quote, often attributed to FLEB's famous "Thu, 3
Jun 2004 02:29:06 -0400" speech:
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}


Small correction:

// If it doesn't look like a character class, just move along...
else {
$outstring .= $instring{$i};
$i++;
$charcount++;
}

I had to move the concatenation line up, before I incremented $i, or all
hell would break loose.

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #5

P: n/a
Regarding this well-known quote, often attributed to FLEB's famous "Thu, 3
Jun 2004 02:29:06 -0400" speech:
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {


....AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {
--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #6

P: n/a
On Thu, 3 Jun 2004 02:29:06 -0400, FLEB
<so*********@mmers.and.evil.ones.will.bow-down-to.us> wrote:
Regarding this well-known quote, often attributed to yawnmoth's famous "2
Jun 2004 21:05:08 -0700" speech:
<snip>
This isn't tested code (look it over, it's a bit late, locally), but I
think if you html_entity_decode() anything that looks like an HTML entity,
and the result is only one character, than you can safely assume it's a
valid HTML entity.

Ref: http://us3.php.net/manual/en/functio...ity-decode.php

<?php
$instring = '&amp; will encode. &bogus; will not.';
$outstring = '';
$charcount = 0;

for ($i=0; $i<strlen($instring); /* I'll increment $i myself */ ) {
// If it IS something that looks like a character class...
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {
// Isolate it
$testcase = $matchybits[0];

// If it decodes down to one character...
if (strlen(html_entity_decode($testcase)) == 1) {

// increment the charcount variable by one,
// increment the index pointer past the element
// and spit the raw HTML entity out to the output

$charcount++;
$i += strlen($matchybits);
$outstring .= $matchybits;

}
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}
}

if ($charcount == 80) {
$outstring .= " ";
// All that, just to add a space...
}
}

// $instring is unchanged
// $outstring is your output,
// $charcount is the length, without new spaces, of the string
// $i and $matchybits are junk

?>


i wasn't aware of the html_special_entity function - thanks for
introducing me to that, and for the code segment! :)
Jul 17 '05 #7

P: n/a
On Thu, 03 Jun 2004 05:54:16 +0000, yawnmoth wrote:
i suppose i could do something like &(nbsp|amp|gt|lt| etc ); or
&((n(bsp|tilde))|amp);, but... the former isn't going to be uber fast
(especially since i already have to loop through the string, anyway),
and... the later is going to be *very* hard to write, having tons of
paranthesis, being very long, etc.
I had in mind something like &[a-z]+;

additionally, i don't know what every single html escape code is.

anyway, as i said before, i think the way to go is to use some
implementation of a finite state machine that returns the current
state for each one character input. regular expressions are
unsuitable for this task because they don't return states,


The only finite state machine generator for PHP that I know of is Libero.
(http://www.imatix.com/html/libero/index.htm). It's free, but I've never
used it. I was looking into it when I needed lexer classes for C++. Flex
output was disgusting. Unfortunately, the project that I needed it for
was killed, so I never looked at Libero again. In other words, may the
force be with you.
--
Trust me, I know what I'm doing. (Sledge Hammer)

Jul 17 '05 #8

P: n/a
FLEB wrote:
...AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {


That pattern matches "& lt ;", but not "&lt". The former is clearly
*not* an entity reference, whereas the latter is.

From <http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm>, I understand
this PCRE matches entity references in HTML4.01 (untested):

`&[a-z][a-z0-9.:_-]*[\r;]?`i

An entity reference, in HTML, begins with entity reference open ("&"),
followed by a letter and zero or more name characters, and ends either
(a) implicitly, with the first non-name character, or (b) explicitly,
with a record end (carriage return) or reference end (";").

That's not all though, for you must know *when* to parse for entity
references. Character sequences matching the syntax of entity
references may not actually *be* entity references. It's a mistake,
for example, to replace "&lt;" in a comment with "<" -- what the
entity reference "&lt;" refers to: a character reference representing
"<".

As to the original point of discussion, why are spaces being
introduced into HTML? And why are entity references being
dereferenced?

--
Jock
Jul 17 '05 #9

P: n/a
Regarding this well-known quote, often attributed to John Dunlop's famous
"Thu, 3 Jun 2004 18:01:38 +0100" speech:
FLEB wrote:
...AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {


That pattern matches "& lt ;", but not "&lt". The former is clearly
*not* an entity reference, whereas the latter is.

From <http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm>, I understand
this PCRE matches entity references in HTML4.01 (untested):

`&[a-z][a-z0-9.:_-]*[\r;]?`i

An entity reference, in HTML, begins with entity reference open ("&"),
followed by a letter and zero or more name characters, and ends either
(a) implicitly, with the first non-name character, or (b) explicitly,
with a record end (carriage return) or reference end (";").

That's not all though, for you must know *when* to parse for entity
references. Character sequences matching the syntax of entity
references may not actually *be* entity references. It's a mistake,
for example, to replace "&lt;" in a comment with "<" -- what the
entity reference "&lt;" refers to: a character reference representing
"<".

As to the original point of discussion, why are spaces being
introduced into HTML? And why are entity references being
dereferenced?


The regexp is just a rough test to filter out anything that remotely looks
like an entity reference. If it passes the rough test, the code attempts to
de-entity the matched text (&[^;];), and if it succeeds in de-entitying (if
the result is one character long), the text was obviously a valid entity.
The entire matched portion is then read as one character, for the purpose
of counting eighty characters. A more stricter regexp, /^&[a-zA-Z]+;/ would
have worked, true, but mine will work just as well.

If the de-entity fails (returns multiple characters), then the program just
counts the ampersand and goes on, just like any other character. This way,
something like "& blah, blah, &amp; blah! ;" will count the first & as a
normal character, since trying to de-entity it returns more than one
character, and move on. After the first ampersand is eaten, it will later
regex match on &amp;, that will convert to one character, "&", and the
program will thus count it as one.

I'm not sure on this, but are you ever actually supposed to have ampersands
in anything except a character entity in HTML/XML? AFAIK, you should use
&amp;. I might be totally wrong, though.

Comments (knowing WHEN to parse) are something I hadn't really taken into
account. Good call. For this person's uses, I suppose they should skip over
anything within a <!-- --> block and call it zero chars, since it won't add
to the display size.

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #10

P: n/a
On Thu, 03 Jun 2004 06:01:57 -0400, Mladen Gogala
<go****@sbcglobal.net> wrote:
On Thu, 03 Jun 2004 05:54:16 +0000, yawnmoth wrote:
<snip>
The only finite state machine generator for PHP that I know of is Libero.
(http://www.imatix.com/html/libero/index.htm). It's free, but I've never
used it. I was looking into it when I needed lexer classes for C++. Flex
output was disgusting. Unfortunately, the project that I needed it for
was killed, so I never looked at Libero again. In other words, may the
force be with you.


i hadn't heard of that - thanks! :)
Jul 17 '05 #11

P: n/a
FLEB wrote:

[ ... ]
I'm not sure on this, but are you ever actually supposed to have ampersands
in anything except a character entity in HTML/XML?
Yes and no. ;o)

In XML, except in CDATA sections (XML1.0 sec. 2.7), ampersands cannot
appear in their literal form; in HTML, however, unless an ampersand
begins an entity reference or forms part of the beginning of a
character reference, it's not markup.
AFAIK, you should use &amp;.


That's what the HTML spec recommends too.

Have a good weekend!

--
Jock
Jul 17 '05 #12

This discussion thread is closed

Replies have been disabled for this discussion.