469,336 Members | 5,991 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,336 developers. It's quick & easy.

Unexpected exception from socket.getaddrinfo on Unicode URL

Here's a strange little bug. "socket.getaddrinfo" blows up
if given a bad domain name containing ".." in Unicode. The
same string in ASCII produces the correct "gaierror" exception.

Actually, this deserves a documentation mention. The "socket" module,
given a Unicode string, calls the International Domain Name parser,
"idna.py", which has a a whole error system of its own. The IDNA
documentation says that "Furthermore, the socket module transparently converts
Unicode host names to ACE, so that applications need not be concerned about
converting host names themselves when they pass them to the socket module."
However, that's not quite true; the IDNA rules say that syntax errors must
be treated as errors, so you have to be prepared for IDNA exceptions.
They are all "UnicodeError" exceptions.

It's worth a mention in the documentation for "socket".

John Nagle

D:\>/python25/python.exe
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>ss = 'www.gallery84..com'
uss = unicode(ss)
import socket
socket.getaddrinfo(ss,"http")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: (11001, 'getaddrinfo failed')
>>socket.getaddrinfo(uss,"http")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\python25\lib\encodings\idna.py", line 164, in encode
result.append(ToASCII(label))
File "D:\python25\lib\encodings\idna.py", line 73, in ToASCII
raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
>>>
Apr 21 '07 #1
2 3729
John Nagle wrote:
Here's a strange little bug. "socket.getaddrinfo" blows up
if given a bad domain name containing ".." in Unicode. The
same string in ASCII produces the correct "gaierror" exception.

Actually, this deserves a documentation mention. The "socket" module,
given a Unicode string, calls the International Domain Name parser,
"idna.py", which has a a whole error system of its own. The IDNA
documentation says that "Furthermore, the socket module transparently converts
Unicode host names to ACE, so that applications need not be concerned about
converting host names themselves when they pass them to the socket module."
However, that's not quite true; the IDNA rules say that syntax errors must
be treated as errors, so you have to be prepared for IDNA exceptions.
They are all "UnicodeError" exceptions.

It's worth a mention in the documentation for "socket".

John Nagle

D:\>/python25/python.exe
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>ss = 'www.gallery84..com'
>>uss = unicode(ss)
>>import socket
>>socket.getaddrinfo(ss,"http")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: (11001, 'getaddrinfo failed')
>>socket.getaddrinfo(uss,"http")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\python25\lib\encodings\idna.py", line 164, in encode
result.append(ToASCII(label))
File "D:\python25\lib\encodings\idna.py", line 73, in ToASCII
raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
>>>
I took a look at the documentation but couldn't see where to add what,
given that the documentation for socket already says:

"""All errors raise exceptions. The normal exceptions for invalid
argument types and out-of-memory conditions can be raised; errors
related to socket or address semantics raise the error socket.error.
""".

Do we really need to specifically mention Unicode errors?

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 21 '07 #2
Steve Holden wrote:
John Nagle wrote:
> Here's a strange little bug. "socket.getaddrinfo" blows up
if given a bad domain name containing ".." in Unicode. The
same string in ASCII produces the correct "gaierror" exception.

Actually, this deserves a documentation mention. The "socket"
module,
given a Unicode string, calls the International Domain Name parser,
"idna.py", which has a a whole error system of its own. The IDNA
documentation says that "Furthermore, the socket module transparently
converts Unicode host names to ACE, so that applications need not be
concerned about converting host names themselves when they pass them
to the socket module."
However, that's not quite true; the IDNA rules say that syntax errors
must
be treated as errors, so you have to be prepared for IDNA exceptions.
They are all "UnicodeError" exceptions.

It's worth a mention in the documentation for "socket".

John Nagle

D:\>/python25/python.exe
Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
> >>ss = 'www.gallery84..com'
uss = unicode(ss)
import socket
socket.getaddrinfo(ss,"http")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
socket.gaierror: (11001, 'getaddrinfo failed')
> >>socket.getaddrinfo(uss,"http")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\python25\lib\encodings\idna.py", line 164, in encode
result.append(ToASCII(label))
File "D:\python25\lib\encodings\idna.py", line 73, in ToASCII
raise UnicodeError("label empty or too long")
UnicodeError: label empty or too long
> >>>
I took a look at the documentation but couldn't see where to add what,
given that the documentation for socket already says:

"""All errors raise exceptions. The normal exceptions for invalid
argument types and out-of-memory conditions can be raised; errors
related to socket or address semantics raise the error socket.error.
""".

Do we really need to specifically mention Unicode errors?
It says "errors related to socket or address semantics raise the
error 'socket.error'", so, yes. The error really has nothing to
do with Unicode; it's that a different parser is used when a domain
name is in Unicode. It really shouldn't be a "Unicode error" at
all.

When Python goes to Unicode by default, this is likely to break
some existing code. Python's IDNA support is good, but not entirely
invisible. The socket module documentation should mention IDNA
support. It's not clear, for example, when you call "getnameinfo()",
whether you get back the name in Unicode or in Punycode.

John Nagle
Apr 21 '07 #3

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

reply views Thread by Bernhard Schmidt | last post: by
25 posts views Thread by Justin Robbs | last post: by
32 posts views Thread by Rene Pijlman | last post: by
1 post views Thread by mirandacascade | last post: by
3 posts views Thread by Thomas Dybdahl Ahle | last post: by
3 posts views Thread by Giampaolo Rodola' | last post: by
reply views Thread by =?Utf-8?B?T2xpdmllciBHSUw=?= | last post: by
1 post views Thread by Karl Chen | last post: by
Xx r3negade
2 posts views Thread by Xx r3negade | last post: by
reply views Thread by zhoujie | last post: by
reply views Thread by suresh191 | last post: by
reply views Thread by Marylou17 | last post: by
1 post views Thread by Marylou17 | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.