A critique of cgi.escape

Lawrence D'Oliveiro

The "escape" function in the "cgi" module escapes characters with special
meanings in HTML. The ones that need escaping are '<', '&' and '"'.
However, cgi.escape only escapes the quote character if you pass a second
argument of True (the default is False):

>>cgi.escape("the \"quick\" & <brownfox")

'the "quick" & <brown> fox'

>>cgi.escape("the \"quick\" & <brownfox", True)

'the "quick" & <brown> fox'

This seems to me to be dumb. The default option should be the safe one: that
is, escape _all_ the potentially troublesome characters. The only time you
can get away with NOT escaping the quote character is outside of markup,
e.g.

<TEXTAREA>
unescaped "quotes" allowed here
</TEXTAREA>

Nevertheless, even in that situation, escaped quotes are acceptable.

So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

Can changing the default break existing scripts? I don't see how. It might
even fix a few lurking bugs out there.

Sep 23 '06 #1

Subscribe Reply

131

9136

Fredrik Lundh

Lawrence D'Oliveiro wrote:

So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.

Can changing the default break existing scripts? I don't see how. It might
even fix a few lurking bugs out there.

I'm not sure this "every time I don't immediately understand something,
I'll write a change proposal instead of reading the library reference"
approach is healthy, really.

</F>

Sep 23 '06 #2

Lawrence D'Oliveiro

In message <ma**************************************@python.o rg>, Fredrik
Lundh wrote:

Lawrence D'Oliveiro wrote:

>So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes.

What works for attributes also works for ordinary text.

Sep 23 '06 #3

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

Lawrence D'Oliveiro wrote:
>So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.

He's not confused, he's correct; the author of cgi.escape is the
confused one. The optional extra parameter is completely unnecessary
and achieves nothing except to make it easier for people to end up
with bugs in their code.

Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character. This is essentially
identical to the quote character in HTML, so any code which escaping
one should always be escaping the other.

Sep 24 '06 #4

Lawrence D'Oliveiro

In message <sl***********************@snowy.squish.net>, Jon Ribbens wrote:

In article <ma**************************************@python.o rg>, Fredrik
Lundh wrote:
>Lawrence D'Oliveiro wrote:
>>>
So I think the default for the second argument to cgi.escape should be
changed to True. Or alternatively, the second argument should be removed
altogether, and quotes should always be escaped.

you're confused: cgi.escape(s) is designed to be used for ordinary text,
cgi.escape(s, True) is designed for attributes. if you use the code the
way it's intended to be used, it works perfectly fine.

He's not confused, he's correct; the author of cgi.escape is the
confused one.

Thanks for backing me up. :)

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character. This is essentially
identical to the quote character in HTML, so any code which escaping
one should always be escaping the other.

I must confess I did a double-take on this. But I rechecked the HTML spec
(HTML 4.0, section 3.2.2, "Attributes"), and you're right--single quotes
ARE allowed as an alternative to double quotes. It's just I've never used
them as quotes. :)

Sep 24 '06 #5

Fredrik Lundh

Lawrence D'Oliveiro wrote:

What works for attributes also works for ordinary text.

attributes and ordinary text are two different things in HTML and XML.
you're arguing that it's a good idea for *everyone* to bloat down
ordinary text just because you're too lazy to use a piece of code in the
intended way.

</F>

Sep 24 '06 #6

Fredrik Lundh

Jon Ribbens wrote:

Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character.

it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes. again, punishing people who
actually read the docs and understand them is not a very good way to
maintain software.

btw, you're both missing that cgi.escape isn't good enough for general
use anyway, since it doesn't deal with encodings at all. if you want a
general purpose function that can be used for everything that can be put
in an HTML file, you need more than just a modified cgi.escape. feel
free to propose a general-purpose replacement (which should have a new
name), but make sure you think through *all* the issues before you do that.

</F>

Sep 24 '06 #7

Lawrence D'Oliveiro

In message <ma**************************************@python.o rg>, Fredrik
Lundh wrote:

Jon Ribbens wrote:

>Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.

I don't understand this "bloat down" nonsense. Any tests that would break
are obviously testing the wrong thing.

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character.

it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes.

Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

btw, you're both missing that cgi.escape isn't good enough for general
use anyway, since it doesn't deal with encodings at all.

Why does it need to?

Sep 24 '06 #8

Fredrik Lundh

Lawrence D'Oliveiro wrote:

Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

do you ever think before you post?

</F>

Sep 24 '06 #9

Georg Brandl

Lawrence D'Oliveiro wrote:

In message <ma**************************************@python.o rg>, Fredrik
Lundh wrote:

>Jon Ribbens wrote:

>>Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way, and will also
break unit tests.

I don't understand this "bloat down" nonsense. Any tests that would break
are obviously testing the wrong thing.

" is 4 characters more than ".

> One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character.

it's intentional, of course: you're supposed to use " if you're using
cgi.escape(s, True) to escape attributes.

Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

A function is broken if its implementation doesn't match the documentation.

As a courtesy, I've pasted it below.

escape(s[, quote])
Convert the characters "&", "<" and ">" in string s to HTML-safe sequences.
Use this if you need to display text that might contain such characters in HTML.
If the optional flag quote is true, the quotation mark character (""") is also
translated; this helps for inclusion in an HTML attribute value, as in <A
HREF="...">. If the value to be quoted might include single- or double-quote
characters, or both, consider using the quoteattr() function in the
xml.sax.saxutils module instead.
Now, do you still think cgi.escape is broken?
Georg

Sep 24 '06 #10

Fredrik Lundh

Georg Brandl wrote:

A function is broken if its implementation doesn't match the documentation.

or if it doesn't match the designer's intent. cgi.escape is old enough
that we would have noticed that, by now...

</F>

Sep 24 '06 #11

Jon Ribbens

In article <ef**********@news.albasani.net>, Georg Brandl wrote:

>Attributes can be quoted with either single or double quotes. That's what
the HTML spec says. cgi.escape doesn't correctly allow for that. Ergo,
cgi.escape is broken. QED.

A function is broken if its implementation doesn't match the documentation.

Or if the design, as described in the documentation, is flawed in some
way.

As a courtesy, I've pasted it below.

[...]

>
Now, do you still think cgi.escape is broken?

Yes.

Sep 25 '06 #12

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way,

By a miniscule degree. That is a very weak argument by any standard.

and will also break unit tests.

Er, so change the unit tests at the same time?

One thing that is flat-out wrong, by the way, is that cgi.escape()
does not encode the apostrophe (') character.

it's intentional, of course:

I noticed. That doesn't mean it isn't wrong.

you're supposed to use " if you're using cgi.escape(s, True) to
escape attributes. again, punishing people who actually read the
docs and understand them is not a very good way to maintain
software.

In what way is anyone being "punished"? Deliberately retaining flaws
and misfeatures that can easily be fixed without damaging
backwards-compatibility is not a very good way to maintain software
either.

btw, you're both missing that cgi.escape isn't good enough for general
use anyway,

I'm sorry, I didn't realise this was a general thread about any and
all inadequacies of Python's cgi module.

since it doesn't deal with encodings at all.

Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.

Sep 25 '06 #13

Lawrence D'Oliveiro

In message <ma**************************************@python.o rg>, Fredrik
Lundh wrote:

Georg Brandl wrote:

>A function is broken if its implementation doesn't match the
documentation.

or if it doesn't match the designer's intent. cgi.escape is old enough
that we would have noticed that, by now...

_We_ certainly have noticed it.

Sep 25 '06 #14

Fredrik Lundh

Lawrence D'Oliveiro wrote:

>Georg Brandl wrote:

>>A function is broken if its implementation doesn't match the
documentation.

or if it doesn't match the designer's intent. cgi.escape is old enough
that we would have noticed that, by now...

_We_ certainly have noticed it.

you're not the designer, you're just some random guy who thinks that if you
don't understand something at first, it has to be changed, even if it that change
would break things for others. maybe you haven't done software long enough
to understand that software works better if you use it the way it was intended
to be used, but that's no excuse for being stupid.

</F>

Sep 25 '06 #15

Fredrik Lundh

Jon Ribbens wrote:

Or if the design, as described in the documentation, is flawed in some
way.

it does exactly what it says, and is perfectly usable as is, if you bother to
use it the way it was intended to be used.

(still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
easier to piss on others than to actually contribute something useful).

</F>

Sep 25 '06 #16

Fredrik Lundh

Jon Ribbens wrote:

>since it doesn't deal with encodings at all.

Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

If you're really serious about making things easier to use, shouldn't
you look at the whole picture? HTML documents are byte streams, so
any transformation from internal character data to HTML must take both
escaping and encoding into account. If you and Lawrence have a hard
time remembering how to use the existing cgi.escape function, despite
it's utter simplicity, surely it would make your life even easier if
there was an alternative API that would handle both the easy part
(escaping) and the hard part (encoding) ?

By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.

I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself. Breaking things just because you think
you can simply isn't the Python way of doing things.

</F>

Sep 25 '06 #17

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

maybe you haven't done software long enough to understand that
software works better if you use it the way it was intended to be
used, but that's no excuse for being stupid.

So what's your excuse?

Sep 25 '06 #18

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

If you're really serious about making things easier to use, shouldn't
you look at the whole picture? HTML documents are byte streams, so
any transformation from internal character data to HTML must take both
escaping and encoding into account.

Ever heard of modular programming? I would suggest that you do indeed
take a step back and look at the whole picture - it's the whole
picture that needs to take escaping and encoding into account. There's
nothing to say that cgi.escape should take them both into account in
the one function, and in fact as you yourself have already commented,
good reasons for it not to, in that it would make it excessively
complicated.

If you and Lawrence have a hard time remembering how to use the
existing cgi.escape function, despite it's utter simplicity, surely
it would make your life even easier if there was an alternative API
that would handle both the easy part (escaping) and the hard part
(encoding) ?

You seem to be arguing that because, in an ideal world, it would be
better to throw away the 'cgi' module completely and start again, it
is not worth making minor improvements in what we already have.
I would suggest that this is, to put it mildly, not a good argument.

I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself.

You are merely compounding your bad manners. All of your above
allegations are outright lies. I am not sure if you are simply not
understanding the simple points I am making, or are deliberately
trying to mislead people for some bizarre reason of your own.

Breaking things just because you think you can simply isn't the
Python way of doing things.

Your hyperbole is growing more extravagant. To begin with, you were
claiming that the suggested change would make things (minisculely)
less efficient, now you're claiming it will "break" unspecified
things. What precisely do you think it would "break"?

Sep 25 '06 #19

Duncan Booth

Jon Ribbens <jo********@unequivocal.co.ukwrote:

>and will also break unit tests.

Er, so change the unit tests at the same time?

It is generally a principle of Python that new releases maintain backward
compatability. An incompatible change such proposed here would probably
break many tests for a large number of people.

If the change were seen as a good thing, then a backwards compatible change
(e.g. introducing a function with a different name) might be considered,
but if so it should address the whole issue: the current lack of support
for encodings is IMHO a far bigger problem than whether or a quote mark is
escaped.

Why does it need to? cgi.escape is (or should be) dealing with
character strings, not byte sequences. I must admit,
internationalisation is not my forte, so if there's something
I'm missing here I'd love to hear about it.

If I have a unicode string such as: u'\u201d' (right double quote), then I
want that encoded in my html as '”' (or ” but the numeric form
is better). For many purposes I could just encode it in the encoding to be
used for the page, typically latin1 or utf8, but sometimes that isn't
possible e.g. if you don't know the encoding at the point when you produce
the string, or if there is no translation for the character in the desired
encoding. The character reference will work whatever encoding is used for
the page.

There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;
at present I need to call both cgi.escape and s.encode to get the desired
effect.

Sep 25 '06 #20

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

(still waiting for the "jon's enhanced escape" proposal, btw, but I guess it's
easier to piss on others than to actually contribute something useful).

Well, yes, you certainly seem to be good at the "pissing on others"
part, even if you have to lie to do it. You have had the "enhanced
escape" proposal all along - it was the post which started this
thread! If you are referring to your strawman argument about
encodings, you have yet to show that it's relevant.

If it'll make you any happier, here's the code for the 'cgi.escape'
equivalent that I usually use:

_html_encre = re.compile("[&<>\"'+]")
_html_encodes = { "&": "&", "<": "<", ">": ">", "\"": """,
"'": "'", "+": "+" }

def html_encode(raw):
return re.sub(_html_encre, lambda m: _html_encodes[m.group(0)], raw)

Sep 25 '06 #21

Max M

Fredrik Lundh skrev:

Jon Ribbens wrote:

>By the way, if you could try and put across your proposed arguments as
to why you don't favour this suggested change without the insults and
general rudeness, it would be appreciated.

I've already explained that, but since you're convinced that your use
case is more important than other use cases, and you don't care about
things like stability and respect for existing users of an API, nor
the cost for others to update their code and unit tests, I don't see
much need to repeat myself. Breaking things just because you think
you can simply isn't the Python way of doing things.

This thread is highly entertaining but perhaps not that productive.
Lawrence is right that the escape method doesn't work the way he expects
it to.

Rewriting a library module simply because a developer is surprised is a
*very* bad idea. It would break just about every web app out there that
uses the escape module and uses testing. Which is probably most of them.
That could mean several man years of wasted time. It also makes the
escaped html harder to read for standard cases.

Frederik is right that doing so is utterly ... well let us call it
"unproductive". Stupid is such a harsh word ;-)

Whether someone finds the bloat miniscule and thus a small enough change
to warrant the rewrite does not really matter.

Lawrence is free to write a wrapper and use that instead.

my_escape = lambda st: cgi.escape(st, 1)

So. Lawrence is happy, and the escape works as expected. Several man
years has been saved.

Max M

Sep 25 '06 #22

Fredrik Lundh

Jon Ribbens wrote:

There's nothing to say that cgi.escape should take them both into account
in the one function

so what exactly are you using cgi.escape for in your code ?

What precisely do you think it would "break"?

existing code, and existing tests.

</F>

Sep 25 '06 #23

Jon Ribbens

In article <Xn*************************@127.0.0.1>, Duncan Booth wrote:

It is generally a principle of Python that new releases maintain backward
compatability. An incompatible change such proposed here would probably
break many tests for a large number of people.

Why is the suggested change incompatible? What code would it break?
I agree that it would be a bad idea if it did indeed break backwards
compatibility - but it doesn't.

There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;

I disagree. I think that doing it in one is muddled thinking and
liable to lead to bugs. Why not keep your output as unicode until it
is ready to be output to the browser, and encode it as appropriate
then? Character encoding and character escaping are separate jobs with
separate requirements that are better off handled by separate code.

Sep 25 '06 #24

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>There's nothing to say that cgi.escape should take them both into account
in the one function

so what exactly are you using cgi.escape for in your code ?

To escape characters so that they will be treated as character data
and not control characters in HTML.

>What precisely do you think it would "break"?

existing code, and existing tests.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Sep 25 '06 #25

Fredrik Lundh

Max M wrote:

It also makes the escaped html harder to read for standard cases.

and slows things down a bit.

(cgi.escape(s, True) is slower than cgi.escape(s), for reasons that are
obvious for anyone who's looked at the code).

</F>

Sep 25 '06 #26

Georg Brandl

Jon Ribbens wrote:

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>>There's nothing to say that cgi.escape should take them both into account
in the one function

so what exactly are you using cgi.escape for in your code ?

To escape characters so that they will be treated as character data
and not control characters in HTML.

>>What precisely do you think it would "break"?

existing code, and existing tests.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.

Georg

Sep 25 '06 #27

Duncan Booth

Jon Ribbens <jo********@unequivocal.co.ukwrote:

In article <Xn*************************@127.0.0.1>, Duncan Booth
wrote:
>It is generally a principle of Python that new releases maintain
backward compatability. An incompatible change such proposed here
would probably break many tests for a large number of people.

Why is the suggested change incompatible? What code would it break?
I agree that it would be a bad idea if it did indeed break backwards
compatibility - but it doesn't.

I guess you've never seen anyone write tests which retrieve some generated
html and compare it against the expected value. If the page contains any
unescaped quotes then this change would break it.

>
>There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html
page;

I disagree. I think that doing it in one is muddled thinking and
liable to lead to bugs. Why not keep your output as unicode until it
is ready to be output to the browser, and encode it as appropriate
then? Character encoding and character escaping are separate jobs with
separate requirements that are better off handled by separate code.

Sorry, convert into something I can safely insert wasn't meant to imply
encoding: just entity escaping.

To be clear:

I'm talking about encoding certain characters as entity references. It
doesn't matter whether its the character ampersand or right double quote,
they both want to be converted to entities. Same operation.

The resulting string might be a byte string or it might still be unicode:
the point being that the conversion I want is from unescaped to entity
escaped, not from unicode to byte encoded. Right now the only way the
Python library gives me to do the entity escaping properly has a side
effect of encoding the string. I should be able to do the escaping without
having to encode the string at the same time.

Sep 25 '06 #28

Jon Ribbens

In article <ef**********@news.albasani.net>, Georg Brandl wrote:

>I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.

Sorry, that's still not good enough. Why would any code expect such a
thing?

Sep 25 '06 #29

Max M

Jon Ribbens skrev:

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>>There's nothing to say that cgi.escape should take them both into account
in the one function
so what exactly are you using cgi.escape for in your code ?

To escape characters so that they will be treated as character data
and not control characters in HTML.

>>What precisely do you think it would "break"?
existing code, and existing tests.

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Some examples are:

- Possibly any code that tests for string equality in a rendered
html/xml page. Testing is a prefered development tool these days.

- Code that generates cgi.escaped() markup and (rightfully) for some
reason expects the old behaviour to be used.

- 3. party code that parses/scrapes content from cgi.escaped() markup.
(you could even break Java code this way :-s )

Any change in Python that has these consequences will rightfully be
considered a bug. So what you are suggesting is to knowingly introduce a
bug in the standard library!
You are right that the html generated by cgi.escape() would (probably)
have the same visual appearence in the browsers. But that is a *very*
narrow definition of being bug free and not breaking stuff.

If you cannot think of other examples for yourself where your change
would introduce breakage, you are certainly not an experienced enough
programmer to suggest changes in the standard lib!
Max M

Sep 25 '06 #30

Jon Ribbens

In article <Xn*************************@127.0.0.1>, Duncan Booth wrote:

I guess you've never seen anyone write tests which retrieve some generated
html and compare it against the expected value. If the page contains any
unescaped quotes then this change would break it.

You're right - I've never seen anyone do such a thing. It sounds like
a highly dubious and very fragile sort of test to me, of very limited
use.

I'm talking about encoding certain characters as entity references. It
doesn't matter whether its the character ampersand or right double quote,
they both want to be converted to entities. Same operation.

This is that muddled thinking I was talking about. They are *not* the
same operation. You want to encode "<", for example, because it must
always be encoded to prevent it being treated as an HTML control
character. This has nothing to do with character encodings.

You might sometimes want to escape "right double quote" because it may
or may not be available in the character encoding you using to output
to the browser. Yes, this might sometimes seem a bit similar to the
"<" escaping described above, because one of the ways you could avoid
the character encoding issue would be to use numeric entities, but it
is actually a completely separate issue and is none of the business of
cgi.escape.

By your argument, cgi.escape should in fact escape *every single*
character as a numeric entity, and even that wouldn't work properly
since "&", "#", ";" and the digits might not be in their usual
positions in the output encoding.

Right now the only way the Python library gives me to do the entity
escaping properly has a side effect of encoding the string. I should
be able to do the escaping without having to encode the string at
the same time.

I'm getting lost here - the opposite of what you say above is true.
cgi.escape does the escaping properly (modulo failing to escape
quotes) without encoding.

Sep 25 '06 #31

Max M

Jon Ribbens skrev:

In article <ef**********@news.albasani.net>, Georg Brandl wrote:

>>I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?
Is that so hard to see? If cgi.escape replaced "'" with an entity reference,
code that expects it not to do so would break.

Sorry, that's still not good enough. Why would any code expect such a
thing?

Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?
This funktion which is correct by current documentation will be broken
by you change.

def hasSomeWord(someword):
import urllib
f = urllib.open('http://www.example.com/cgi_escaped_content')
content = f.read()
f.close()
return '"%s"' % someword in content:

You might think that it is stupid code that should be changed to take
escaped quotes into account. But that is really not your bussines to
decide if the other behaviour is documented and correct.

I find it amazing that you cannot understand this. I will stop replying
in this thread now.

Max M

Sep 25 '06 #32

Jon Ribbens

In article <45***********************@dread15.news.tele.dk> , Max M wrote:

>I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Some examples are:

- Possibly any code that tests for string equality in a rendered
html/xml page. Testing is a prefered development tool these days.

Testing is good, but only if done correctly.

- Code that generates cgi.escaped() markup and (rightfully) for some
reason expects the old behaviour to be used.

That's begging the question again ("an example of code that would
break is code that would break").

- 3. party code that parses/scrapes content from cgi.escaped() markup.
(you could even break Java code this way :-s )

I'm sorry, I don't understand that one. What is "party code"? Code
that is scraping content from web sites already has to cope with
entities etc.

Your comment about Java is a little ironic given that I persuaded the
Java Struts people to make the exact same change we're talking about
here, back in 2002 (even if it did take 11 months) ;-)

If you cannot think of other examples for yourself where your change
would introduce breakage, you are certainly not an experienced enough
programmer to suggest changes in the standard lib!

I'll take my own opinion on that over yours, thanks.

Sep 25 '06 #33

and-google

Jon Ribbens wrote:

I'm sorry, that's not good enough. How, precisely, would it break
"existing code"?

('owdo Mr. Ribbens!)

It's possible there could be software that relies on ' not being
escaped, for example:

# Auto-markup links to O'Reilly, everyone's favourite
# example name with an apostrophe in it
#
URI= 'http://www.oreilly.com/'
html= cgi.escape(text)
html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)

Sure this may be rare, but it's what the documentation says, and
changing it may not only fix things but also subtly break things in
ways that are hard to detect.

A similar change to str.encode('unicode-escape') in Python 2.5 caused a
number of similar subtle problems. (In this case the old documentation
was a bit woolly so didn't prescribe the exact older behaviour.)

I'm not saying that the cgi.escape interface is *good*, just that it's
too late to change it.

I personally think the entire function should be deprecated, firstly
because it's insufficient in some corner cases (apostrophes as you
pointed out, and XHTML CDATA), and secondly because it's in the wrong
place: HTML-escaping is nothing to do with the CGI interface. A good
template library should deal with escaping more smoothly and correctly
than cgi.escape. (It may be able to deal with escape-or-not-bother and
character encoding issues automatically, for example.)

--
And Clover
mailto:an*@doxdesk.com
http://www.doxdesk.com/

Sep 25 '06 #34

Jon Ribbens

In article <45***********************@dread15.news.tele.dk> , Max M wrote:

Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?

No, but if nobody else can find one either, that's a clue that maybe
it's safe to change.

Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to. Even by your own argument, therefore, code is not
entitled to rely on the output of cgi.escape being any particular
exact string.

This funktion which is correct by current documentation will be broken
by you change.

def hasSomeWord(someword):
import urllib
f = urllib.open('http://www.example.com/cgi_escaped_content')
content = f.read()
f.close()
return '"%s"' % someword in content:

That function is broken already, no change required.
I find it amazing that you cannot understand this.

Sep 25 '06 #35

Duncan Booth

Jon Ribbens <jo********@unequivocal.co.ukwrote:

In article <ef**********@news.albasani.net>, Georg Brandl wrote:

>>I'm sorry, that's not good enough. How, precisely, would it break
"existing code"? Can you come up with an example, or even an
explanation of how it *could* break existing code?

Is that so hard to see? If cgi.escape replaced "'" with an entity
reference, code that expects it not to do so would break.

Sorry, that's still not good enough. Why would any code expect such a
thing?

It's easy enough to come up with examples which might. For example, I
have doctests which evaluate tal expressions. I don't think I currently
have any which depend on quotes, but I can easily create one (I just
did, and it passes):

>>print T('''<tal:x tal:content="python:'It\\'s a \\x22tal\\x22 string'" />''')

It's a "tal" string

>>print T('''<x tal:attributes="title python:'It\\'s a \\x22tal\\x22 string'" />''')

<x title="It's a "tal" string" />

More likely I might output a field value and just happen to have used a quote
in it.

FWIW, in zope tal, the value of tal:content is escaped using the equivalent of
cgi.escape(s, False), and attribute values are escaped using
cgi.escape(s, True).

The function T I use is defined as:

def T(template, **kw):
"""Create and render a page template."""
pt = PageTemplate()
pt.pt_edit(template, 'text/html')
return pt.pt_render(extra_context=kw).strip('\n')

Sep 25 '06 #36

Fredrik Lundh

Jon Ribbens wrote:

Sorry, that's still not good enough.

that's not up to you to decide, though.

</F>

Sep 25 '06 #37

Jon Ribbens

In article <11**********************@i42g2000cwa.googlegroups .com>, an********@doxdesk.com wrote:

>I'm sorry, that's not good enough. How, precisely, would it break
"existing code"?

('owdo Mr. Ribbens!)

Good afternoon Mr Glover ;-)

URI= 'http://www.oreilly.com/'
html= cgi.escape(text)
html= html.replace('O\'Reilly', '<a href="%s">O\'Reilly</a>' % URI)

Sure this may be rare, but it's what the documentation says, and
changing it may not only fix things but also subtly break things in
ways that are hard to detect.

I'm not sure about "subtly break things", but you're right that the
above code would break. I could argue that it's broken already,
(since it's doing a plain-text search on HTML data) but given
real-world considerations it's reasonable enough that I won't be that
pedantic ;-)

I personally think the entire function should be deprecated, firstly
because it's insufficient in some corner cases (apostrophes as you
pointed out, and XHTML CDATA), and secondly because it's in the wrong
place: HTML-escaping is nothing to do with the CGI interface. A good
template library should deal with escaping more smoothly and correctly
than cgi.escape. (It may be able to deal with escape-or-not-bother and
character encoding issues automatically, for example.)

I agree that in most situations you should probably be using a
template library, but sometimes a simple CGI-and-manual-HTML system
suffices, and I think (a fixed version of) cgi.escape should exist at
a low level of the web application stack.

Sep 25 '06 #38

Filip Salomonsson

On 25 Sep 2006 15:13:30 GMT, Jon Ribbens <jo********@unequivocal.co.ukwrote:

>
Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to.

If the documentation isn't clear enough, that means the documentation
should be fixed.

It does _not_ mean "you are free to introduce new behavior because
nobody should trust what this function does anyway".
--
filip salomonsson

Sep 25 '06 #39

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>Sorry, that's still not good enough.

that's not up to you to decide, though.

It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.

Sep 25 '06 #40

Jon Ribbens

In article <ma**************************************@python.o rg>, Filip Salomonsson wrote:

>Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to.

If the documentation isn't clear enough, that means the documentation
should be fixed.

Incorrect - documentation can and frequently does leave certain
behaviours undefined. This is deliberate and (among other things)
is to allow for the behaviour to change in future versions without
breaking backwards-compatibility.

Sep 25 '06 #41

Fredrik Lundh

Jon Ribbens wrote:

It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.

not if you expect anyone to take anything you say seriously.

</F>

Sep 25 '06 #42

Jon Ribbens

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.

not if you expect anyone to take anything you say seriously.

Now you're just being ridiculous. In this thread you have been rude,
evasive, insulting, vague, hypocritical, and have failed to answer
substantive points in favour of sarcastic and erroneous sniping - I'd
suggest it's you that needs to worry about being taken seriously.

Sep 25 '06 #43

Brian Quinlan

Jon Ribbens wrote:

In article <ma**************************************@python.o rg>, Fredrik Lundh wrote:

>>It's up to me to decide whether or not an argument is good enough to
convince me, thank you very much.
not if you expect anyone to take anything you say seriously.

Now you're just being ridiculous. In this thread you have been rude,
evasive, insulting, vague, hypocritical, and have failed to answer
substantive points in favour of sarcastic and erroneous sniping - I'd
suggest it's you that needs to worry about being taken seriously.

Actually, at least in the context of this mailing list, Fredrik doesn't
have to worry about that at all. Why? Because he is one of the most
prolific contributers to the Python language and libraries and his
contributions have been of consistent high quality.

You, on the other hand, are "just some guy" and people don't have a lot
of incentive to convince you of anything.

I have no opinion on the actual debate though. Just trying to help with
the social analysis :-)

Cheers,
Brian

Sep 25 '06 #44

Jon Ribbens

In article <ma**************************************@python.o rg>, Brian Quinlan wrote:

>Now you're just being ridiculous. In this thread you have been rude,
evasive, insulting, vague, hypocritical, and have failed to answer
substantive points in favour of sarcastic and erroneous sniping - I'd
suggest it's you that needs to worry about being taken seriously.

Actually, at least in the context of this mailing list, Fredrik doesn't
have to worry about that at all. Why? Because he is one of the most
prolific contributers to the Python language and libraries

I would have hoped that people don't treat that as a licence to be
obnoxious, though. I am aware of Fredrik's history, which is why I
was somewhat surprised and disappointed that he was being so rude
and unpleasant in this thread. He is not living up to his reputation
at all. Maybe he's having a bad day ;-)

Sep 25 '06 #45

Georg Brandl

Jon Ribbens wrote:

In article <45***********************@dread15.news.tele.dk> , Max M wrote:
>Oh ... because you cannot see a use case for that *documented*
behaviour, it must certainly be wrong?

No, but if nobody else can find one either, that's a clue that maybe
it's safe to change.

Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to.

It says "to HTML-safe sequences". That's reasonably clear without the need
to reproduce the exact replacements for each character.

If anyone doesn't know what is meant by this, he shouldn't really write apps
using the cgi module before doing a basic HTML course.

Or use the source.

Georg

Sep 25 '06 #46

Dan Bishop

Fredrik Lundh wrote:

Jon Ribbens wrote:

Making cgi.escape always escape the '"' character would not break
anything, and would probably fix a few bugs in existing code. Yes,
those bugs are not cgi.escape's fault, but that's no reason not to
be helpful. It's a minor improvement with no downside.

the "improvement with no downside" would bloat down the output for
everyone who's using the function in the intended way,

"Unless" "your" "CGI" "scripts" "output" "text" "like" "this," "I"
"think" "it's" "absurd" "to" "consider" "the" "bloat" "significant."

Sep 25 '06 #47

Jon Ribbens

In article <ef**********@news.albasani.net>, Georg Brandl wrote:

>Here's a point for you - the documentation for cgi.escape says that
the characters "&", "<" and ">" are converted, but not what they are
converted to.

It says "to HTML-safe sequences". That's reasonably clear without the need
to reproduce the exact replacements for each character.

If anyone doesn't know what is meant by this, he shouldn't really write apps
using the cgi module before doing a basic HTML course.

So would you like to expliain the difference between " and " ,
or do you need to go on a "basic HTML course" first?

Sep 25 '06 #48

Lawrence D'Oliveiro

In message <Xn*************************@127.0.0.1>, Duncan Booth wrote:

If I have a unicode string such as: u'\u201d' (right double quote), then I
want that encoded in my html as '”' (or ” but the numeric form
is better).

Right-double-quote is not an HTML special, so there's no need to quote it.
I'm only concerned here with characters that have special meanings in HTML
markup.

There should be a one-stop shop where I can take my unicode text and
convert it into something I can safely insert into a generated html page;
at present I need to call both cgi.escape and s.encode to get the desired
effect.

What you're really asking for is a version of cgi.escape that a) fixes the
bugs discussed in this thread, and b) copes with different encodings while
doing so.

To handle b), you would need to pass it some indication of what the encoding
of the string is. In any case, converting a literal right-double-quote to
” is not relevant to the purpose of cgi.escape.

Sep 26 '06 #49

Lawrence D'Oliveiro

In message <45***********************@dread15.news.tele.dk> , Max M wrote:

Lawrence is right that the escape method doesn't work the way he expects
it to.

Rewriting a library module simply because a developer is surprised is a
*very* bad idea.

I'm not surprised. Disappointed, yes. Verging on disgust at some comments in
this thread, yes. But "surprised" is what a lot of users of the existing
cgi.escape function are going to be when they discover their code isn't
doing what they thought it was.

It would break just about every web app out there that
uses the escape module...

How will it break them? Give an example.

Sep 26 '06 #50

A critique of cgi.escape

Similar topics