472,800 Members | 1,276 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,800 software developers and data experts.

in-line detection of html escape codes

say i have a for loop that would iterate through every character and
put a space between every 80th one, in effect forcing word wrap to
occur. this can be implemented easily using a regular expression.

if i wanted to improve on this, and make it so stuff in url's didn't
count towards that 80 character limit, a regular expression would not
suffice. however, a simple for loop does.

so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

i was thinking i could sorta simulate a finate state machine that
returns the current state when the function is called. the current
state would then be repassed into the finate state machine along with
the next character in the string, and the new state could be returned.
if the state returned is an accept state, we only count all the
characters in the string of characters that was passed to the FSM
once, and if no state is returned, we could all the characters towards
the 80 character limit.

however, i'm not really sure how to implement the above function. one
problem is that there seem to be a lot of html escape codes, and...
yeah...

any help would be appreciated - thanks! :)
Jul 17 '05 #1
11 4821
On Wed, 02 Jun 2004 21:05:08 -0700, yawnmoth wrote:
so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.


One word: regular expressions.

--
Trust me, I know what I'm doing. (Sledge Hammer)

Jul 17 '05 #2
On Thu, 03 Jun 2004 00:09:14 -0400, Mladen Gogala
<go****@sbcglobal.net> wrote:
On Wed, 02 Jun 2004 21:05:08 -0700, yawnmoth wrote:
so now i'm currious how to account for html escape codes such as
&nbsp; and &copy;. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.


One word: regular expressions.


because half of what i'm trying to do *can't* be done doing regular
expressions (you can verify this for yourself by using the pumping
lemma on it), why would i want to do the other half in regular
expressions? i want my code to have a big-o effeciency as close to
O(n) as possible - not O(n**3), or whatever.

also, what exact regular expression would you propose? &[^&;]*; isn't
a good one because not just any string of characters between a & and ;
can make an html escape code - only certain ones can. an example of
one that isn't is &asdf;

i suppose i could do something like &(nbsp|amp|gt|lt| etc ); or
&((n(bsp|tilde))|amp);, but... the former isn't going to be uber fast
(especially since i already have to loop through the string, anyway),
and... the later is going to be *very* hard to write, having tons of
paranthesis, being very long, etc.

additionally, i don't know what every single html escape code is.

anyway, as i said before, i think the way to go is to use some
implementation of a finite state machine that returns the current
state for each one character input. regular expressions are
unsuitable for this task because they don't return states, etc.
Jul 17 '05 #3
Regarding this well-known quote, often attributed to yawnmoth's famous "2
Jun 2004 21:05:08 -0700" speech:
say i have a for loop that would iterate through every character and
put a space between every 80th one, in effect forcing word wrap to
occur. this can be implemented easily using a regular expression.

if i wanted to improve on this, and make it so stuff in url's didn't
count towards that 80 character limit, a regular expression would not
suffice. however, a simple for loop does.

so now i'm currious how to account for html escape codes such as
&nbsp; and &copy;. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

i was thinking i could sorta simulate a finate state machine that
returns the current state when the function is called. the current
state would then be repassed into the finate state machine along with
the next character in the string, and the new state could be returned.
if the state returned is an accept state, we only count all the
characters in the string of characters that was passed to the FSM
once, and if no state is returned, we could all the characters towards
the 80 character limit.

however, i'm not really sure how to implement the above function. one
problem is that there seem to be a lot of html escape codes, and...
yeah...

any help would be appreciated - thanks! :)

This isn't tested code (look it over, it's a bit late, locally), but I
think if you html_entity_decode() anything that looks like an HTML entity,
and the result is only one character, than you can safely assume it's a
valid HTML entity.

Ref: http://us3.php.net/manual/en/functio...ity-decode.php

<?php
$instring = '&amp; will encode. &bogus; will not.';
$outstring = '';
$charcount = 0;

for ($i=0; $i<strlen($instring); /* I'll increment $i myself */ ) {
// If it IS something that looks like a character class...
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {
// Isolate it
$testcase = $matchybits[0];

// If it decodes down to one character...
if (strlen(html_entity_decode($testcase)) == 1) {

// increment the charcount variable by one,
// increment the index pointer past the element
// and spit the raw HTML entity out to the output

$charcount++;
$i += strlen($matchybits);
$outstring .= $matchybits;

}
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}
}

if ($charcount == 80) {
$outstring .= " ";
// All that, just to add a space...
}
}

// $instring is unchanged
// $outstring is your output,
// $charcount is the length, without new spaces, of the string
// $i and $matchybits are junk

?>

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #4
Regarding this well-known quote, often attributed to FLEB's famous "Thu, 3
Jun 2004 02:29:06 -0400" speech:
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}


Small correction:

// If it doesn't look like a character class, just move along...
else {
$outstring .= $instring{$i};
$i++;
$charcount++;
}

I had to move the concatenation line up, before I incremented $i, or all
hell would break loose.

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #5
Regarding this well-known quote, often attributed to FLEB's famous "Thu, 3
Jun 2004 02:29:06 -0400" speech:
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {


....AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {
--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #6
On Thu, 3 Jun 2004 02:29:06 -0400, FLEB
<so*********@mmers.and.evil.ones.will.bow-down-to.us> wrote:
Regarding this well-known quote, often attributed to yawnmoth's famous "2
Jun 2004 21:05:08 -0700" speech:
<snip>
This isn't tested code (look it over, it's a bit late, locally), but I
think if you html_entity_decode() anything that looks like an HTML entity,
and the result is only one character, than you can safely assume it's a
valid HTML entity.

Ref: http://us3.php.net/manual/en/functio...ity-decode.php

<?php
$instring = '&amp; will encode. &bogus; will not.';
$outstring = '';
$charcount = 0;

for ($i=0; $i<strlen($instring); /* I'll increment $i myself */ ) {
// If it IS something that looks like a character class...
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {
// Isolate it
$testcase = $matchybits[0];

// If it decodes down to one character...
if (strlen(html_entity_decode($testcase)) == 1) {

// increment the charcount variable by one,
// increment the index pointer past the element
// and spit the raw HTML entity out to the output

$charcount++;
$i += strlen($matchybits);
$outstring .= $matchybits;

}
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}
}

if ($charcount == 80) {
$outstring .= " ";
// All that, just to add a space...
}
}

// $instring is unchanged
// $outstring is your output,
// $charcount is the length, without new spaces, of the string
// $i and $matchybits are junk

?>


i wasn't aware of the html_special_entity function - thanks for
introducing me to that, and for the code segment! :)
Jul 17 '05 #7
On Thu, 03 Jun 2004 05:54:16 +0000, yawnmoth wrote:
i suppose i could do something like &(nbsp|amp|gt|lt| etc ); or
&((n(bsp|tilde))|amp);, but... the former isn't going to be uber fast
(especially since i already have to loop through the string, anyway),
and... the later is going to be *very* hard to write, having tons of
paranthesis, being very long, etc.
I had in mind something like &[a-z]+;

additionally, i don't know what every single html escape code is.

anyway, as i said before, i think the way to go is to use some
implementation of a finite state machine that returns the current
state for each one character input. regular expressions are
unsuitable for this task because they don't return states,


The only finite state machine generator for PHP that I know of is Libero.
(http://www.imatix.com/html/libero/index.htm). It's free, but I've never
used it. I was looking into it when I needed lexer classes for C++. Flex
output was disgusting. Unfortunately, the project that I needed it for
was killed, so I never looked at Libero again. In other words, may the
force be with you.
--
Trust me, I know what I'm doing. (Sledge Hammer)

Jul 17 '05 #8
FLEB wrote:
...AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {


That pattern matches "& lt ;", but not "&lt". The former is clearly
*not* an entity reference, whereas the latter is.

From <http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm>, I understand
this PCRE matches entity references in HTML4.01 (untested):

`&[a-z][a-z0-9.:_-]*[\r;]?`i

An entity reference, in HTML, begins with entity reference open ("&"),
followed by a letter and zero or more name characters, and ends either
(a) implicitly, with the first non-name character, or (b) explicitly,
with a record end (carriage return) or reference end (";").

That's not all though, for you must know *when* to parse for entity
references. Character sequences matching the syntax of entity
references may not actually *be* entity references. It's a mistake,
for example, to replace "&lt;" in a comment with "<" -- what the
entity reference "&lt;" refers to: a character reference representing
"<".

As to the original point of discussion, why are spaces being
introduced into HTML? And why are entity references being
dereferenced?

--
Jock
Jul 17 '05 #9
Regarding this well-known quote, often attributed to John Dunlop's famous
"Thu, 3 Jun 2004 18:01:38 +0100" speech:
FLEB wrote:
...AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {


That pattern matches "& lt ;", but not "&lt". The former is clearly
*not* an entity reference, whereas the latter is.

From <http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm>, I understand
this PCRE matches entity references in HTML4.01 (untested):

`&[a-z][a-z0-9.:_-]*[\r;]?`i

An entity reference, in HTML, begins with entity reference open ("&"),
followed by a letter and zero or more name characters, and ends either
(a) implicitly, with the first non-name character, or (b) explicitly,
with a record end (carriage return) or reference end (";").

That's not all though, for you must know *when* to parse for entity
references. Character sequences matching the syntax of entity
references may not actually *be* entity references. It's a mistake,
for example, to replace "&lt;" in a comment with "<" -- what the
entity reference "&lt;" refers to: a character reference representing
"<".

As to the original point of discussion, why are spaces being
introduced into HTML? And why are entity references being
dereferenced?


The regexp is just a rough test to filter out anything that remotely looks
like an entity reference. If it passes the rough test, the code attempts to
de-entity the matched text (&[^;];), and if it succeeds in de-entitying (if
the result is one character long), the text was obviously a valid entity.
The entire matched portion is then read as one character, for the purpose
of counting eighty characters. A more stricter regexp, /^&[a-zA-Z]+;/ would
have worked, true, but mine will work just as well.

If the de-entity fails (returns multiple characters), then the program just
counts the ampersand and goes on, just like any other character. This way,
something like "& blah, blah, &amp; blah! ;" will count the first & as a
normal character, since trying to de-entity it returns more than one
character, and move on. After the first ampersand is eaten, it will later
regex match on &amp;, that will convert to one character, "&", and the
program will thus count it as one.

I'm not sure on this, but are you ever actually supposed to have ampersands
in anything except a character entity in HTML/XML? AFAIK, you should use
&amp;. I might be totally wrong, though.

Comments (knowing WHEN to parse) are something I hadn't really taken into
account. Good call. For this person's uses, I suppose they should skip over
anything within a <!-- --> block and call it zero chars, since it won't add
to the display size.

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com
Jul 17 '05 #10
On Thu, 03 Jun 2004 06:01:57 -0400, Mladen Gogala
<go****@sbcglobal.net> wrote:
On Thu, 03 Jun 2004 05:54:16 +0000, yawnmoth wrote:
<snip>
The only finite state machine generator for PHP that I know of is Libero.
(http://www.imatix.com/html/libero/index.htm). It's free, but I've never
used it. I was looking into it when I needed lexer classes for C++. Flex
output was disgusting. Unfortunately, the project that I needed it for
was killed, so I never looked at Libero again. In other words, may the
force be with you.


i hadn't heard of that - thanks! :)
Jul 17 '05 #11
FLEB wrote:

[ ... ]
I'm not sure on this, but are you ever actually supposed to have ampersands
in anything except a character entity in HTML/XML?
Yes and no. ;o)

In XML, except in CDATA sections (XML1.0 sec. 2.7), ampersands cannot
appear in their literal form; in HTML, however, unless an ampersand
begins an entity reference or forms part of the beginning of a
character reference, it's not markup.
AFAIK, you should use &amp;.


That's what the HTML spec recommends too.

Have a good weekend!

--
Jock
Jul 17 '05 #12

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: Curious Expatriate | last post by:
Hi- I'm completely stumped. I'm trying to write some code that will parse a file and rewrite it with all URLs replaced by something else. For example: if the file looks like this: <b>click...
1
by: JS Bangs | last post by:
I started using PHP's object-oriented stuff a little while ago, which has mostly been a joy. However, I've noticed that they don't seem to echo as I would like. Eg: $this->field = 255;...
5
by: lawrence | last post by:
I've waited 6 weeks for an answer to my other question and still no luck, so let me rephrase the question. I know I can do this: <form method="post" action="$self"> <input type="text"...
0
by: Ben Eisenberg | last post by:
I'm trying to run a php script setuid. I've tried POSIX_setuid but you have to be root to run this. The files are located on a public access unix system and have me as the owner and nobody as the...
1
by: James | last post by:
What is the best way to update a record in a MYSQL DB using a FORM and PHP ? Where ID = $ID ! Any examples or URLS ? Thanks
1
by: phpkid | last post by:
Howdy I've been given conflicting answers about search engines picking up urls like: http://mysite.com/index.php?var1=1&var2=2&var3=3 Do search engines pick up these urls? I've been considering...
1
by: lawrence | last post by:
What is the PHP equivalent of messaging, as in Java?
3
by: Quinten Carlson | last post by:
Is there a way to conditionally define a function in php? I'm trying to run a php page 10 times using the include statement, but I get an error because my function is already defined. The docs...
2
by: Phillip Wu | last post by:
Hi, I saw a previous post about sending arrays but did not quite understand the answers. The problem is that I would like to pass an entire array as a hidden input field from one php script...
4
by: Matt Schroeder | last post by:
Does anyone know how to count how many rows are in a mysql table? This is what I have, but it doesn't work right: <? $db = mysql_connect("localhost", "username", "password");...
0
by: erikbower65 | last post by:
Using CodiumAI's pr-agent is simple and powerful. Follow these steps: 1. Install CodiumAI CLI: Ensure Node.js is installed, then run 'npm install -g codiumai' in the terminal. 2. Connect to...
0
by: erikbower65 | last post by:
Here's a concise step-by-step guide for manually installing IntelliJ IDEA: 1. Download: Visit the official JetBrains website and download the IntelliJ IDEA Community or Ultimate edition based on...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Sept 2023 starting at 18:00 UK time (6PM UTC+1) and finishing at about 19:15 (7.15PM) The start time is equivalent to 19:00 (7PM) in Central...
0
by: Taofi | last post by:
I try to insert a new record but the error message says the number of query names and destination fields are not the same This are my field names ID, Budgeted, Actual, Status and Differences ...
14
DJRhino1175
by: DJRhino1175 | last post by:
When I run this code I get an error, its Run-time error# 424 Object required...This is my first attempt at doing something like this. I test the entire code and it worked until I added this - If...
5
by: DJRhino | last post by:
Private Sub CboDrawingID_BeforeUpdate(Cancel As Integer) If = 310029923 Or 310030138 Or 310030152 Or 310030346 Or 310030348 Or _ 310030356 Or 310030359 Or 310030362 Or...
0
by: lllomh | last post by:
Define the method first this.state = { buttonBackgroundColor: 'green', isBlinking: false, // A new status is added to identify whether the button is blinking or not } autoStart=()=>{
0
by: Mushico | last post by:
How to calculate date of retirement from date of birth
2
by: DJRhino | last post by:
Was curious if anyone else was having this same issue or not.... I was just Up/Down graded to windows 11 and now my access combo boxes are not acting right. With win 10 I could start typing...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.