in-line detection of html escape codes

yawnmoth

say i have a for loop that would iterate through every character and
put a space between every 80th one, in effect forcing word wrap to
occur. this can be implemented easily using a regular expression.

if i wanted to improve on this, and make it so stuff in url's didn't
count towards that 80 character limit, a regular expression would not
suffice. however, a simple for loop does.

so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

i was thinking i could sorta simulate a finate state machine that
returns the current state when the function is called. the current
state would then be repassed into the finate state machine along with
the next character in the string, and the new state could be returned.
if the state returned is an accept state, we only count all the
characters in the string of characters that was passed to the FSM
once, and if no state is returned, we could all the characters towards
the 80 character limit.

however, i'm not really sure how to implement the above function. one
problem is that there seem to be a lot of html escape codes, and...
yeah...

any help would be appreciated - thanks! :)

Jul 17 '05 #1

Subscribe Reply

4867

Mladen Gogala

On Wed, 02 Jun 2004 21:05:08 -0700, yawnmoth wrote:

so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

One word: regular expressions.

--
Trust me, I know what I'm doing. (Sledge Hammer)

Jul 17 '05 #2

yawnmoth

On Thu, 03 Jun 2004 00:09:14 -0400, Mladen Gogala
<go****@sbcglobal.net> wrote:

On Wed, 02 Jun 2004 21:05:08 -0700, yawnmoth wrote:
so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

One word: regular expressions.

because half of what i'm trying to do *can't* be done doing regular
expressions (you can verify this for yourself by using the pumping
lemma on it), why would i want to do the other half in regular
expressions? i want my code to have a big-o effeciency as close to
O(n) as possible - not O(n**3), or whatever.

also, what exact regular expression would you propose? &[^&;]*; isn't
a good one because not just any string of characters between a & and ;
can make an html escape code - only certain ones can. an example of
one that isn't is &asdf;

i suppose i could do something like &(nbsp|amp|gt|lt| etc ); or
&((n(bsp|tilde))|amp);, but... the former isn't going to be uber fast
(especially since i already have to loop through the string, anyway),
and... the later is going to be *very* hard to write, having tons of
paranthesis, being very long, etc.

additionally, i don't know what every single html escape code is.

anyway, as i said before, i think the way to go is to use some
implementation of a finite state machine that returns the current
state for each one character input. regular expressions are
unsuitable for this task because they don't return states, etc.

Jul 17 '05 #3

FLEB

Regarding this well-known quote, often attributed to yawnmoth's famous "2
Jun 2004 21:05:08 -0700" speech:

say i have a for loop that would iterate through every character and
put a space between every 80th one, in effect forcing word wrap to
occur. this can be implemented easily using a regular expression.

if i wanted to improve on this, and make it so stuff in url's didn't
count towards that 80 character limit, a regular expression would not
suffice. however, a simple for loop does.

so now i'm currious how to account for html escape codes such as
  and ©. since i have a for-loop, in-line detection seems
to be the way to go, although i'm not really sure how to implement it.

i was thinking i could sorta simulate a finate state machine that
returns the current state when the function is called. the current
state would then be repassed into the finate state machine along with
the next character in the string, and the new state could be returned.
if the state returned is an accept state, we only count all the
characters in the string of characters that was passed to the FSM
once, and if no state is returned, we could all the characters towards
the 80 character limit.

however, i'm not really sure how to implement the above function. one
problem is that there seem to be a lot of html escape codes, and...
yeah...

any help would be appreciated - thanks! :)

This isn't tested code (look it over, it's a bit late, locally), but I
think if you html_entity_decode() anything that looks like an HTML entity,
and the result is only one character, than you can safely assume it's a
valid HTML entity.

Ref: http://us3.php.net/manual/en/functio...ity-decode.php

<?php
$instring = '& will encode. &bogus; will not.';
$outstring = '';
$charcount = 0;

for ($i=0; $i<strlen($instring); /* I'll increment $i myself */ ) {
// If it IS something that looks like a character class...
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {
// Isolate it
$testcase = $matchybits[0];

// If it decodes down to one character...
if (strlen(html_entity_decode($testcase)) == 1) {

// increment the charcount variable by one,
// increment the index pointer past the element
// and spit the raw HTML entity out to the output

$charcount++;
$i += strlen($matchybits);
$outstring .= $matchybits;

}
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}
}

if ($charcount == 80) {
$outstring .= " ";
// All that, just to add a space...
}
}

// $instring is unchanged
// $outstring is your output,
// $charcount is the length, without new spaces, of the string
// $i and $matchybits are junk

?>

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com

Jul 17 '05 #4

FLEB

Regarding this well-known quote, often attributed to FLEB's famous "Thu, 3
Jun 2004 02:29:06 -0400" speech:

// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}

Small correction:

// If it doesn't look like a character class, just move along...
else {
$outstring .= $instring{$i};
$i++;
$charcount++;
}

I had to move the concatenation line up, before I incremented $i, or all
hell would break loose.

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com

Jul 17 '05 #5

FLEB

Regarding this well-known quote, often attributed to FLEB's famous "Thu, 3
Jun 2004 02:29:06 -0400" speech:

if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {

....AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {
--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com

Jul 17 '05 #6

yawnmoth

On Thu, 3 Jun 2004 02:29:06 -0400, FLEB
<so*********@mmers.and.evil.ones.will.bow-down-to.us> wrote:

Regarding this well-known quote, often attributed to yawnmoth's famous "2
Jun 2004 21:05:08 -0700" speech:
<snip>
This isn't tested code (look it over, it's a bit late, locally), but I
think if you html_entity_decode() anything that looks like an HTML entity,
and the result is only one character, than you can safely assume it's a
valid HTML entity.

Ref: http://us3.php.net/manual/en/functio...ity-decode.php

<?php
$instring = '& will encode. &bogus; will not.';
$outstring = '';
$charcount = 0;

for ($i=0; $i<strlen($instring); /* I'll increment $i myself */ ) {
// If it IS something that looks like a character class...
if ($preg_match('/^&[^;];/', substr($instring, $i), $matchybits)) {
// Isolate it
$testcase = $matchybits[0];

// If it decodes down to one character...
if (strlen(html_entity_decode($testcase)) == 1) {

// increment the charcount variable by one,
// increment the index pointer past the element
// and spit the raw HTML entity out to the output

$charcount++;
$i += strlen($matchybits);
$outstring .= $matchybits;

}
// If it doesn't look like a character class, just move along...
else {
$i++;
$charcount++;
$outstring .= $instring{$i};
}
}

if ($charcount == 80) {
$outstring .= " ";
// All that, just to add a space...
}
}

// $instring is unchanged
// $outstring is your output,
// $charcount is the length, without new spaces, of the string
// $i and $matchybits are junk

?>

i wasn't aware of the html_special_entity function - thanks for
introducing me to that, and for the code segment! :)

Jul 17 '05 #7

Mladen Gogala

On Thu, 03 Jun 2004 05:54:16 +0000, yawnmoth wrote:

i suppose i could do something like &(nbsp|amp|gt|lt| etc ); or
&((n(bsp|tilde))|amp);, but... the former isn't going to be uber fast
(especially since i already have to loop through the string, anyway),
and... the later is going to be *very* hard to write, having tons of
paranthesis, being very long, etc.
I had in mind something like &[a-z]+;

additionally, i don't know what every single html escape code is.

anyway, as i said before, i think the way to go is to use some
implementation of a finite state machine that returns the current
state for each one character input. regular expressions are
unsuitable for this task because they don't return states,

The only finite state machine generator for PHP that I know of is Libero.
(http://www.imatix.com/html/libero/index.htm). It's free, but I've never
used it. I was looking into it when I needed lexer classes for C++. Flex
output was disgusting. Unfortunately, the project that I needed it for
was killed, so I never looked at Libero again. In other words, may the
force be with you.
--
Trust me, I know what I'm doing. (Sledge Hammer)

Jul 17 '05 #8

John Dunlop

FLEB wrote:

...AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {

That pattern matches "& lt ;", but not "&lt". The former is clearly
*not* an entity reference, whereas the latter is.

From <http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm>, I understand
this PCRE matches entity references in HTML4.01 (untested):

`&[a-z][a-z0-9.:_-]*[\r;]?`i

An entity reference, in HTML, begins with entity reference open ("&"),
followed by a letter and zero or more name characters, and ends either
(a) implicitly, with the first non-name character, or (b) explicitly,
with a record end (carriage return) or reference end (";").

That's not all though, for you must know *when* to parse for entity
references. Character sequences matching the syntax of entity
references may not actually *be* entity references. It's a mistake,
for example, to replace "<" in a comment with "<" -- what the
entity reference "<" refers to: a character reference representing
"<".

As to the original point of discussion, why are spaces being
introduced into HTML? And why are entity references being
dereferenced?

--
Jock

Jul 17 '05 #9

FLEB

Regarding this well-known quote, often attributed to John Dunlop's famous
"Thu, 3 Jun 2004 18:01:38 +0100" speech:

FLEB wrote:
...AND I can't write a freakin' regexp, it seems... I forgot the + after
the character class (+, not *, since the entity &; doesn't exist):

if ($preg_match('/^&[^;]+;/', substr($instring, $i), $matchybits)) {

That pattern matches "& lt ;", but not "&lt". The former is clearly
*not* an entity reference, whereas the latter is.

From <http://xml.coverpages.org/sgmlsyn/sgmlsyn.htm>, I understand
this PCRE matches entity references in HTML4.01 (untested):

`&[a-z][a-z0-9.:_-]*[\r;]?`i

An entity reference, in HTML, begins with entity reference open ("&"),
followed by a letter and zero or more name characters, and ends either
(a) implicitly, with the first non-name character, or (b) explicitly,
with a record end (carriage return) or reference end (";").

That's not all though, for you must know *when* to parse for entity
references. Character sequences matching the syntax of entity
references may not actually *be* entity references. It's a mistake,
for example, to replace "<" in a comment with "<" -- what the
entity reference "<" refers to: a character reference representing
"<".

As to the original point of discussion, why are spaces being
introduced into HTML? And why are entity references being
dereferenced?

The regexp is just a rough test to filter out anything that remotely looks
like an entity reference. If it passes the rough test, the code attempts to
de-entity the matched text (&[^;];), and if it succeeds in de-entitying (if
the result is one character long), the text was obviously a valid entity.
The entire matched portion is then read as one character, for the purpose
of counting eighty characters. A more stricter regexp, /^&[a-zA-Z]+;/ would
have worked, true, but mine will work just as well.

If the de-entity fails (returns multiple characters), then the program just
counts the ampersand and goes on, just like any other character. This way,
something like "& blah, blah, & blah! ;" will count the first & as a
normal character, since trying to de-entity it returns more than one
character, and move on. After the first ampersand is eaten, it will later
regex match on &, that will convert to one character, "&", and the
program will thus count it as one.

I'm not sure on this, but are you ever actually supposed to have ampersands
in anything except a character entity in HTML/XML? AFAIK, you should use
&. I might be totally wrong, though.

Comments (knowing WHEN to parse) are something I hadn't really taken into
account. Good call. For this person's uses, I suppose they should skip over
anything within a  block and call it zero chars, since it won't add
to the display size.

--
-- Rudy Fleminger
-- sp@mmers.and.evil.ones.will.bow-down-to.us
(put "Hey!" in the Subject line for priority processing!)
-- http://www.pixelsaredead.com

Jul 17 '05 #10

yawnmoth

On Thu, 03 Jun 2004 06:01:57 -0400, Mladen Gogala
<go****@sbcglobal.net> wrote:

On Thu, 03 Jun 2004 05:54:16 +0000, yawnmoth wrote:
<snip>
The only finite state machine generator for PHP that I know of is Libero.
(http://www.imatix.com/html/libero/index.htm). It's free, but I've never
used it. I was looking into it when I needed lexer classes for C++. Flex
output was disgusting. Unfortunately, the project that I needed it for
was killed, so I never looked at Libero again. In other words, may the
force be with you.

i hadn't heard of that - thanks! :)

Jul 17 '05 #11

John Dunlop

FLEB wrote:

[ ... ]

I'm not sure on this, but are you ever actually supposed to have ampersands
in anything except a character entity in HTML/XML?
Yes and no. ;o)

In XML, except in CDATA sections (XML1.0 sec. 2.7), ampersands cannot
appear in their literal form; in HTML, however, unless an ampersand
begins an entity reference or forms part of the beginning of a
character reference, it's not markup.
AFAIK, you should use &.

That's what the HTML spec recommends too.

Have a good weekend!

--
Jock

Jul 17 '05 #12

Similar topics

6745

how to replace urls in a document (with regular expression)

by: Curious Expatriate | last post by:

Hi- I'm completely stumped. I'm trying to write some code that will parse a file and rewrite it with all URLs replaced by something else. For example: if the file looks like this: <b>click...

PHP

6236

Object properties in echo()

by: JS Bangs | last post by:

I started using PHP's object-oriented stuff a little while ago, which has mostly been a joy. However, I've noticed that they don't seem to echo as I would like. Eg: $this->field = 255;...

PHP

16135

can I use arrays in forms? with a string index?

by: lawrence | last post by:

I've waited 6 weeks for an answer to my other question and still no luck, so let me rephrase the question. I know I can do this: <form method="post" action="$self"> <input type="text"...

PHP

4902

setuid in a php prog

by: Ben Eisenberg | last post by:

I'm trying to run a php script setuid. I've tried POSIX_setuid but you have to be root to run this. The files are located on a public access unix system and have me as the owner and nobody as the...

PHP

8697

UPDATE a record in MYSQL DB with PHP

by: James | last post by:

What is the best way to update a record in a MYSQL DB using a FORM and PHP ? Where ID = $ID ! Any examples or URLS ? Thanks

PHP

3401

PHP urls with variable data in search engine results

by: phpkid | last post by:

Howdy I've been given conflicting answers about search engines picking up urls like: http://mysite.com/index.php?var1=1&var2=2&var3=3 Do search engines pick up these urls? I've been considering...

PHP

2550

messaging like in Java

by: lawrence | last post by:

What is the PHP equivalent of messaging, as in Java?

PHP

4902

Conditionally define a function in php

by: Quinten Carlson | last post by:

Is there a way to conditionally define a function in php? I'm trying to run a php page 10 times using the include statement, but I get an error because my function is already defined. The docs...

PHP

27638

Passing an entire array in PHP

by: Phillip Wu | last post by:

Hi, I saw a previous post about sending arrays but did not quite understand the answers. The problem is that I would like to pass an entire array as a hidden input field from one php script...

PHP

18657

Counting rows in mysql

by: Matt Schroeder | last post by:

Does anyone know how to count how many rows are in a mysql table? This is what I have, but it doesn't work right: <? $db = mysql_connect("localhost", "username", "password");...

PHP

7136

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

7344

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

7412

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

7505

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

5060

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

3203

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

1570

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...

C# / C Sharp

775

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP

441

Comprehensive Guide to Website Development in Toronto: Expert Insights from BSMN Consultancy

by: bsmnconsultancy | last post by:

In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

General