A <TEXTAREA> is a weird thing.

Les Paul

I'm trying to design an HTML page that can edit itself. In essence, it's
just like a Wiki page, but my own very simple version. It's a page full
of plain old HTML content, and then at the bottom, there's an "Edit"
link. So the page itself looks something like this:

<HTML><HEAD><TITLE>blah</TITLE></HEAD><BODY>


<H1>Hello World!</H1>
<P>More stuff here...</P>


<A HREF="/cgi-bin/editpage.cgi?page=thisfile.html">Edit</A>
</BODY></HTML>

So if you click the "Edit" link, the CGI script goes out and reads the
body of the "thisfile.html" file. It uses those special "TEXT STARTS
HERE" tags to identify where the editable content starts and stops. It
then just dumps whatever is between those tags into a page that looks
like this:

<HTML><HEAD><TITLE>blah</TITLE></HEAD><BODY>
<FORM ACTION="/cgi-bin/savepage.cgi" METHOD="POST">
<TEXTAREA NAME="pagetext" COLS="120" ROWS="35" WRAP="OFF">

<H1>Hello World!</H1>
<P>More stuff here...</P>

</TEXTAREA>
<INPUT TYPE=SUBMIT VALUE="Save">
</FORM>
</BODY></HTML>

So, on that page, you can edit whatever you want inside the TEXTAREA and
then click the "Save" button and the "savepage.cgi" script will write the
new data right into "thisfile.html". And so far, this works great.

But I ran into trouble as soon as I started trying to enter special
characters in the TEXTAREA. For example, I open the page and click the
"Edit" link and then start typing in fancy stuff like this:

<H1>Hello World!</H1>
<P>More stuff here...</P>
<   Some text   >

And it *seems* to work at first. When I save the page and then view it,
sure enough, I see what I'd expect, something like:

Hello World!
More stuff here...
< Some text >

But the next time I try to edit it, I can see that something has gone
horribly wrong underneath the surface. The " " characters are gone,
and have been replaced by some weird unicode whitespace characters (or
something?). The "<" and ">" get replaced by actual "<" and ">"
symbols, which is no good at all. The web browser now thinks that
"<Some text>" is some sort of tag and doesn't display it at all. Yuck.

So I searched around a little, and I see that it's actually standard
behavior for the <TEXTAREA> field to automatically perform conversions
like this. Ok, fine, so how do I turn this "feature" off so that
*exactly* what I type gets saved? At first, I thought that I could fix
it in my CGI script by just always expanding "<" and ">" and other
special characters back into their HTML equivalents.

But that obviously won't work, because then it will mangle every tag in
the whole file. In fact, the text in the example above would become:

<H1>Hello World!</H1>
<P>More stuff here...</P>
...

Ugh. So I can't really solve this in the script. I need to turn this
dumb behavior of the TEXTAREA off, or this just can't work. I need for
"<" to stay "<" and I need for "<" to stay "<" and that's it. Any
ideas?

Thanks for reading, and thanks for any help.

Pat

Jul 24 '05 #1

Subscribe Post Reply

13644

Benjamin Niemann

Les Paul wrote:

I'm trying to design an HTML page that can edit itself. In essence, it's
just like a Wiki page, but my own very simple version. It's a page full
of plain old HTML content, and then at the bottom, there's an "Edit"
link. So the page itself looks something like this:

<HTML><HEAD><TITLE>blah</TITLE></HEAD><BODY>


<H1>Hello World!</H1>
<P>More stuff here...</P>


<A HREF="/cgi-bin/editpage.cgi?page=thisfile.html">Edit</A>
</BODY></HTML>

So if you click the "Edit" link, the CGI script goes out and reads the
body of the "thisfile.html" file. It uses those special "TEXT STARTS
HERE" tags to identify where the editable content starts and stops. It
then just dumps whatever is between those tags into a page that looks
like this:

<HTML><HEAD><TITLE>blah</TITLE></HEAD><BODY>
<FORM ACTION="/cgi-bin/savepage.cgi" METHOD="POST">
<TEXTAREA NAME="pagetext" COLS="120" ROWS="35" WRAP="OFF">

<H1>Hello World!</H1>
<P>More stuff here...</P>

</TEXTAREA>
<INPUT TYPE=SUBMIT VALUE="Save">
</FORM>
</BODY></HTML>

So, on that page, you can edit whatever you want inside the TEXTAREA and
then click the "Save" button and the "savepage.cgi" script will write the
new data right into "thisfile.html". And so far, this works great.

But I ran into trouble as soon as I started trying to enter special
characters in the TEXTAREA. For example, I open the page and click the
"Edit" link and then start typing in fancy stuff like this:

<H1>Hello World!</H1>
<P>More stuff here...</P>
<   Some text   >

And it *seems* to work at first. When I save the page and then view it,
sure enough, I see what I'd expect, something like:

Hello World!
More stuff here...
< Some text >

But the next time I try to edit it, I can see that something has gone
horribly wrong underneath the surface. The " " characters are gone,
and have been replaced by some weird unicode whitespace characters (or
something?). The "<" and ">" get replaced by actual "<" and ">"
symbols, which is no good at all. The web browser now thinks that
"<Some text>" is some sort of tag and doesn't display it at all. Yuck.

So I searched around a little, and I see that it's actually standard
behavior for the <TEXTAREA> field to automatically perform conversions
like this. Ok, fine, so how do I turn this "feature" off so that
*exactly* what I type gets saved? At first, I thought that I could fix
it in my CGI script by just always expanding "<" and ">" and other
special characters back into their HTML equivalents.

But that obviously won't work, because then it will mangle every tag in
the whole file. In fact, the text in the example above would become:

<H1>Hello World!</H1>
<P>More stuff here...</P>
...

Ugh. So I can't really solve this in the script. I need to turn this
dumb behavior of the TEXTAREA off, or this just can't work. I need for
"<" to stay "<" and I need for "<" to stay "<" and that's it. Any
ideas?

Thanks for reading, and thanks for any help.

You must escape at least < and & (and any other character you want to see as
an entity instead of 'weird unicode whitespace characters', e.g. the
character with ASCII code 160 into ' ') before you insert the HTML to
edit into the <textarea> in your script. Look for an apropriate function in
your programming language's library.

The generated HTML code should look like this:

<TEXTAREA>
<H1>Hello World!</H1>
<P>More stuff here...</P>
&lt;&nbsp;&nbsp; Some text &nbsp;&nbsp;&gt;
</TEXTAREA>

The browser does not need to know that the editable text is HTML code.
The fact that browsers parse e.g. <H1> as a literal string "<H1>" instead of
a <H1> tag (that is not allowed in <TEXTAREA>) is just another case of
'over-tolerant' behaviour that obviously causes only confusion.

--
Benjamin Niemann
Email: pink at odahoda dot de
WWW: http://www.odahoda.de/

Jul 24 '05 #2

Les Paul

Benjamin Niemann <pi**@odahoda.de> wrote in

[cut]

You must escape at least < and & (and any other character you want to
see as an entity instead of 'weird unicode whitespace characters',
e.g. the character with ASCII code 160 into ' ') before you
insert the HTML to edit into the <textarea> in your script. Look for
an apropriate function in your programming language's library.

The generated HTML code should look like this:

<TEXTAREA>
<H1>Hello World!</H1>
<P>More stuff here...</P>
&lt;&nbsp;&nbsp; Some text &nbsp;&nbsp;&gt;
</TEXTAREA>

The browser does not need to know that the editable text is HTML code.
The fact that browsers parse e.g. <H1> as a literal string "<H1>"
instead of a <H1> tag (that is not allowed in <TEXTAREA>) is just
another case of 'over-tolerant' behaviour that obviously causes only
confusion.

This is exactly what I *don't* want to happen. I don't want the <H1> to
be converted to anything, I want it to stay as <H1>. Likewise, I don't
want the "<" to be converted to "<". If the user types in <H1>blah
</H1> in the box, then he expects to get a page with a big, bold "blah"
on it. If he types in "<" then he expects to get a page with a "<" on
it. I don't want TEXTAREA to convert either of those for me. I can do
any type of conversion necessary in my script file, but it won't work.

That's what I was talking aboutwhen I said:

At first, I thought that I could fix
it in my CGI script by just always expanding "<" and ">" and other
special characters back into their HTML equivalents.

But that obviously won't work, because then it will mangle every tag
in the whole file.

If you look at my example, here's what the user types into the box,
exactly letter for letter (nothing has been converted yet):

<H1>Hello World</H1>
More stuff here...
< &nbsp Some text   >

And then he clicks the "Save" button on the form. But before it gets to
my CGI script, it has already been mangled. So inside my perl code, this
is what the string looks like to me:

<H1>Hello World</H1>
More stuff here...
< Some text >

See the problem? How do I know which "<" and ">" characters should be
converted *back* to the HTML codes (e.g. <)? It's already too late at
this point. I have no way to know if "< Some text >" is an HTML tag
like "<H1>", or if the user really wants to have a less-than sign,
followed by three spaces, followed by "Some text" etc...

Pat

Jul 24 '05 #3

Lachlan Hunt

Les Paul wrote:

Benjamin Niemann <pi**@odahoda.de> wrote in

[cut]

You must escape at least < and & (and any other character you want to
see as an entity instead of 'weird unicode whitespace characters',
...
The generated HTML code should look like this:

<TEXTAREA>
<H1>Hello World!</H1>
<P>More stuff here...</P>
&lt;&nbsp;&nbsp; Some text &nbsp;&nbsp;&gt;
</TEXTAREA>
This is exactly what I *don't* want to happen.

Yet, it is exactly what *must* happen if this is to work correctly for
you. You simply misunderstand the reasoning behind it.
I don't want the <H1> to be converted to anything, I want it to stay as
<H1>.
Yes, you want an <h1> to be an <h1> in the final document, but for it
must be converted for the editing process to happen correctly.

For example, if this very simplified example is the final document:

<h1>Heading</h1>

When a user selects to edit the page, you want the markup displayed
within the textarea for the user to edit and submit back, which should
look like this diagram:
______________________
|<h1>Heading</h1> |
| |
|_____________________|

The markup for that to be done correctly, needs to be this:

<textarea rows="..." cols="...">
<h1>Heading</h1>
</textarea>

The markup your current system produces...

<textarea rows="..." cols="...">
<h1>Heading</h1>
</textarea>

....is invalid because the content of a textarea is defined as #PCDATA,
not #CDATA, so elements and entity references are supposed to be parsed
as elements and entity references, not plain text, despite the behaviour
of existing tag-soup browsers. Try running your current site through
the validator, and you'll see what I mean.
http://validator.w3.org/
Likewise, I don't want the "<" to be converted to "<".

Then, within the textarea, your system must convert all ampersands "&"
to &. That means, there anywhere an entity reference such as &,
< or > occurs, the output must be &amp;, &lt;, and
&gt; respectively. So, any occurance of just & and < and
> will always be converted by the browser, but this will give you the
result you want, and is valid markup.

So, to extend the previous example, the text area markup could look like
this (I've added spaces for easier reading):

<textarea rows="..." cols="...">
<h1>Heading</h1>

<p> &lt; Some&nbsp;Content &gt; </p>
</textarea>

When submitted, you system should convert all of &, < > back
to &, < and >, respectively. So, when the above is submitted, the final
output markup should be:

<h1>Heading</h1>

<p> < Some Content > </p>

I hope this explains it clearly enough for you.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 24 '05 #4

Lachlan Hunt

Lachlan Hunt wrote:

When submitted, you system should convert all of &, < > back
to &, < and >, respectively. So, when the above is submitted, the final
output markup should be:

Oops, my mistake. The system recieving the submission shouldn't have to
perform any conversions, because browsers will submit the content in the
correct form. eg. Markup within a text area like this:

<textarea ...>
&lt;h1&gt;Heading&lt;/h1&gt;
</textarea>

Which will render like this within the textarea:

<h1>Heading</h1>

Will be submitted in exactly the way you want it. The only converstions
will need to be done in order to generate the proper markup for within
the textarea when a user requests to edit the page, not after a user
submits content.

--
Lachlan Hunt
http://lachy.id.au/
http://GetFirefox.com/ Rediscover the Web
http://GetThunderbird.com/ Reclaim your Inbox

Jul 24 '05 #5

Alan J. Flavell

On Fri, 8 Apr 2005, Lachlan Hunt wrote:

Oops, my mistake. The system recieving the submission shouldn't
have to perform any conversions, because browsers will submit the
content in the correct form.
If only it was so simple for all of the situations which arise in
practice!
<h1>Heading</h1>

Will be submitted in exactly the way you want it.
Except that your example was far too simple. Try a euro sign in a
page coded as iso-8859-1, or a Russian letter, or some Arabic...
The only converstions will need to be done in order to generate the
proper markup for within the textarea when a user requests to edit
the page, not after a user submits content.

The confusion with ampersands was irrelevant, really. Naturally,
markup-significant characters, if they are not to perform their HTML
function, have to be escaped *IN HTML SOURCE* by using ampersand
notation: this is no different in a textarea than it is in other HTML
context.

But once the textarea has been rendered on the browser, and is being
prepared for submission, HTML itself plays no further part in the
proceedings. The rules for encoding text areas for submission are
clearly defined in the scope where they are defined, and are clearly
documented as undefined outside of that scope, whether we like it or
not. (I s'pose it's inevitable now that I'm going to mention my web
page on the topic,
http://ppewww.ph.gla.ac.uk/~flavell/...form-i18n.html ; but Jukka
also has good material on forms submission, for example.)

As I say there,

Nevertheless, as an author, this isn't under your control: readers
can and will submit extended characters - there's nothing you can do
to stop them - so your server-side scripts need to be able to do
something with them - if only to recognise them and politely refuse
them (but preferably something more constructive, if you feel up to
it).

Dealing with newlines in text areas is also a hassle. Or can be.

Jul 24 '05 #6

Andreas Prilop

On Fri, 8 Apr 2005, Alan J. Flavell wrote:

Except that your example was far too simple. Try a euro sign in a
page coded as iso-8859-1,

General agreement among browsers and search engines seems
to be that the euro sign is x80 in ISO-8859-1. Or rather:
ISO-8859-1 is an alias for Windows-1252. grmpf

Jul 24 '05 #7

Lars Eighner

In our last episode, <Xn******************@140.99.99.130>, the
lovely and talented Les Paul broadcast on
comp.infosystems.www.authoring.html:

This is exactly what I *don't* want to happen. I don't want the <H1> to
be converted to anything, I want it to stay as <H1>. Likewise, I don't
want the "<" to be converted to "<". If the user types in <H1>blah
</H1> in the box, then he expects to get a page with a big, bold "blah"
on it. If he types in "<" then he expects to get a page with a "<" on
it. I don't want TEXTAREA to convert either of those for me. I can do
any type of conversion necessary in my script file, but it won't work.
TEXTAREA doesn't *do* anything, and certainly doesn't convert
strings. Your complaint is about browser behavior.
See the problem? How do I know which "<" and ">" characters should be
converted *back* to the HTML codes (e.g. <)? It's already too late at
this point. I have no way to know if "< Some text >" is an HTML tag
like "<H1>", or if the user really wants to have a less-than sign,
followed by three spaces, followed by "Some text" etc...

Less-than would be entered &lt;

--
Lars Eighner ei*****@io.com http://www.larseighner.com/
War on Terrorism: History a Mystery
"He's busy making history, but doesn't look back at his own, or the
world's.... Bush would rather look forward than backward." --_Newsweek_

Jul 24 '05 #8

Jan Roland Eriksson

On Fri, 08 Apr 2005 09:50:01 GMT, Les Paul <no**@none.none> wrote:
[...]

...here's what the user types into the box, exactly letter for
letter... <H1>Hello World</H1>
More stuff here...
< &nbsp Some text   > ...before it gets to my CGI script, it has already been mangled.
So inside my perl code, this is what the string looks like... <H1>Hello World</H1>
More stuff here...
< Some text > See the problem?
No, I don't.
How do I know which "<" and ">" characters should be converted...
You should apply the SGML standard parsing rules that has been in effect
for close to two decades now.
...I have no way to know if "< Some text >" is an HTML tag
Yes you have since according to said parsing rules, just that part can
_not_ be an HTML tag.

Any SGML start tag must be built as follows, if its based on the SGML
concrete reference syntax (as is the case for HTML).

=====

1:STAGO (StartTAGOpen, defined to be the character '<')

2:_directly_ followed by a defined NAMESTART character

3:optionally followed by an arbitrary number of defined NAMECHARacters,
that together with the initial NAMESTART character forms the element
name

4:optionally followed by one or more white space characters

5:optionally followed by yet another NAMESTART character

6:optionally followed by an arbitrary number of NAMECHARacters that
together with the initial NAMESTART character forms an attribute name

7:optionally followed by one or more white space characters

8:followed by one "=" character

9:optionally followed by one or more white space characters

10:followed by a quoted attribute value character string

11:optionally repeat from 4: for more attribute definitions

12:optionally followed by one or more white space characters

13:TAGC (TAGClose, defined to be the character '>')

=====

The first "beauty" of the SGML parsing method is that it in just about
any foreseeable case does not require a "look ahead" of more than one
character at a time to give accurate results.

The second "beauty" of that method is that there are tons of free Perl
code libraries available to do the job for you.

A Google search for - "SGML parsing" AND "Perl library" - gives some 100
hits, surely you can find what you need in there.
...or if the user really wants to have a less-than sign,
followed by three spaces, followed by "Some text" etc...
The thing that disqualifies your "< Some text >" example from being
a valid start tag is that there is white space between the possible
STAGO and the characters that might be valid as a NAMESTART followed by
NAMECHARacters. That one is really easy to parse.

Of course you will need to identify end tags too.
The method is close to the same as already described...

=====

1:ETAGO (EndTAGOpen, defined to be the characters '</')

2a:_optionally_ followed by the exact element name that is to be closed

3a:optionally followed by one or more white space characters

4a:followed by TAGC (TAGClose, defined to be the character '>')

....OR...

2b:_directly_ followed by TAGC

=====

This last part should tell you that all of these following lines are
valid SGML, and HTML, markup...

<p>some text in a paragraph</p>

<p>some text in another paragraph</p

<p>some text in yet another paragraph</>

Naturally there is a lot more to the game here but I hope that my "crash
course" above will give you an incentive to get onto the "correct track"
that can solve your initial problem.

--
Rex

Jul 24 '05 #9

Les Paul

Jan Roland Eriksson <jr****@newsguy.com> wrote in
news:19********************************@4ax.com:

See the problem?

No, I don't.
How do I know which "<" and ">" characters should be converted...

You should apply the SGML standard parsing rules that has been in
effect for close to two decades now.

That's a horrible, horrible idea. Not to mention that it won't work.
You're suggesting putting an HTML parser in the CGI script just to
determine whether a "<" symbol is part of a tag or not.

It won't work because the user could type the following in the box:

<H1>

And after that was converted to "<H1>" the script would think that was
intended to be a tag, when it wasn't.

Lachlan's suggestion of escaping the special characters before dumping them
between the <TEXTAREA> tags is working just fine.

You know, common sense has been around for a lot more than "close to two
decades now," but some people still seem to lack it.

Pat

Jul 24 '05 #10

Les Paul

Lachlan Hunt <sp***********@gmail.com> wrote in
news:42***********************@per-qv1-newsreader-01.iinet.net.au:

Lachlan Hunt wrote:
When submitted, you system should convert all of &, < >
back to &, < and >, respectively. So, when the above is submitted,
the final output markup should be:

Oops, my mistake. The system recieving the submission shouldn't have
to perform any conversions, because browsers will submit the content
in the correct form. eg. Markup within a text area like this:

Thanks, this is working. Yeah, it looks like I don't have to do anything
on the "receive" side. I just convert everything that goes between the
<textarea> tags and the "edit box" renders them correctly. Well, at least
in IE it does.

Thanks again,
Pat

Jul 24 '05 #11

Pierre Goiffon

Les Paul wrote:

But I ran into trouble as soon as I started trying to enter special
characters in the TEXTAREA. For example, I open the page and click the
"Edit" link and then start typing in fancy stuff like this:

<H1>Hello World!</H1>
<P>More stuff here...</P>
<   Some text   >

And it *seems* to work at first. When I save the page and then view it,
sure enough, I see what I'd expect, something like:

Hello World!
More stuff here...
< Some text >

But the next time I try to edit it, I can see that something has gone
horribly wrong underneath the surface. The " " characters are gone,
and have been replaced by some weird unicode whitespace characters (or
something?). The "<" and ">" get replaced by actual "<" and ">"
symbols, which is no good at all. The web browser now thinks that
"<Some text>" is some sort of tag and doesn't display it at all. Yuck.

Maybe you should try to use a editor component such one of those listed
here :
http://www.bris.ac.uk/is/projects/cms/ttw/ttw.html
I've also heard in good terms of :
http://kupu.oscom.org/
http://composite.mozdev.org/
And anyway you've got the good old contentEditable attribute :
http://msdn.microsoft.com/workshop/a...nteditable.asp

If you don't have enough client side scripting capabilities, you should
use a specific tag language, as a lots of CMS use to do. For example,
see Wikipedia :
http://en.wikipedia.org/wiki/Wikiped...ge#Wiki_markup

Jul 24 '05 #12

A <TEXTAREA> is a weird thing.

Similar topics