473,320 Members | 1,841 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

urlparse.urlparse bug - misparses long URL

Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
====
http://ww***************************...YLYpzDGOFkLZyY
====
What we get back in the "accesshost" field (i.e. the domain name) is

====
'ww*******************************************@Tez zaron.com&xUDysvTbzZZOaymjQ2oYIx2AvMdJ1WQfjP02wIBB QBb1EVZAqmmGunxrcyGx1AcfegWUUYtaZfRW434O5Qn6InSMUZ XgF5e3KzJbCntBGOj7pv31zab&action=login-run&passkey=e84239c9da59dbeb61d4d45db2cc5840&info_ hash=%c9q%be%fe%c6j%ca%fd0%18%fe%23J%bd%89%d3%06L% fdV&info_hash=%18%9d%fb%15v%c0A%1f%c8%dds%0f%17%99 %ceQ%83%a0%3e%27&info_hash=%df%f0%1c%5e%d75%b2%7d% e6D%0d%3e%d8%fbZ%5c%de%2ae%93&https:'
====

which is wrong. Something far out in that URL is breaking urlparse, and it's
not able to extract the domain name properly.

It's not a UNICODE issue; forced the data to "str" and it still mis-parses.

I'm trying to construct s shorter string that fails. More to follow.

(Yes, another error associated with the wonderful world of parsing hostile sites
in Python. This is from a phishing attack, and that URL is in PhishTank.)

John Nagle
SiteTruth
Dec 14 '07 #1
3 1463
John Nagle wrote:
Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
====
http://ww***************************...YLYpzDGOFkLZyY
====
What we get back in the "accesshost" field (i.e. the domain name) is

====
'ww*******************************************@Tez zaron.com&xUDysvTbzZZOaymjQ2oYIx2AvMdJ1WQfjP02wIBB QBb1EVZAqmmGunxrcyGx1AcfegWUUYtaZfRW434O5Qn6InSMUZ XgF5e3KzJbCntBGOj7pv31zab&action=login-run&passkey=e84239c9da59dbeb61d4d45db2cc5840&info_ hash=%c9q%be%fe%c6j%ca%fd0%18%fe%23J%bd%89%d3%06L% fdV&info_hash=%18%9d%fb%15v%c0A%1f%c8%dds%0f%17%99 %ceQ%83%a0%3e%27&info_hash=%df%f0%1c%5e%d75%b2%7d% e6D%0d%3e%d8%fbZ%5c%de%2ae%93&https:'
====

which is wrong. Something far out in that URL is breaking urlparse, and it's
not able to extract the domain name properly.

It's not a UNICODE issue; forced the data to "str" and it still mis-parses.

I'm trying to construct s shorter string that fails. More to follow.

(Yes, another error associated with the wonderful world of parsing hostile sites
in Python. This is from a phishing attack, and that URL is in PhishTank.)

John Nagle
SiteTruth
It's breaking on the first slash, which just happens to be very late in
the URL.
>>urlparse('http://example.com?blahblah=http://example.net')
('http', 'example.com?blahblah=http:', '//example.net', '', '', '')
--
Dec 14 '07 #2
John Nagle wrote:
Here's a hostile URL that "urlparse.urlparse" seems to have mis-parsed.
....
Simpler example:

import urlparse
s = 'http://www.example.com.mx?https://www.example.com'
print urlparse.urlparse(s)

produces
('http', 'www.example.com.mx?https:', '//www.example.com', '', '', '')

which is wrong.

John Nagle
SiteTruth
Dec 14 '07 #3
John Nagle wrote:
>>>>Here's a hostile URL that "urlparse.urlparse" seems to have
mis-parsed.
====
Added to tracker, with proposed fix:

http://bugs.python.org/issue1637

John Nagle
Dec 17 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Tim Clacy | last post by:
How is a 64 bit type defined in strict C++? It seems C has support for 'long long' since C99, but not so for C++? Looking through one compiler vendor's standard library headers has clouded the...
7
by: William Payne | last post by:
Hello, I have a variable of type unsigned long. It has a number of bits set (with set I mean they equal one). I need to determine those bits and their position and create new numbers from them. For...
5
by: Mark Shelor | last post by:
Problem: find a portable way to determine whether a compiler supports the "long long" type of C99. I thought I had this one solved with the following code: #include <limits.h> #ifdef...
29
by: Richard A. Huebner | last post by:
Is the unsigned long long primitive data type supported in ANSI standard C? I've tried using it a couple of times in standard C, but to no avail. I'm using both MS VIsual C++ 6, as well as the...
4
by: Lingyun Yang | last post by:
*** post for FREE via your newsreader at post.newsfeed.com *** Dear all, I have a file it's binary data viewed in UltraEdit is EF BB BF 0D 0A 3C ....... I want to read them into a int or long...
24
by: Michael B Allen | last post by:
I use long longs occasionally. I know there are some limitations regarding the standards such as not using long long constants but what's the big deal? Why is long long not used so much? Mike
4
by: metaperl | last post by:
The urlparse with Python 2.4.3 includes the user and pass in the site aspect of its parse: 'bill:james@docs.python.org' I personally would prefer that it be broken down a bit further. What...
0
by: Neville CD | last post by:
Basically I encountered some smallish problems with a couple of modules and figure I can fix the problems. I did find http://sourceforge.net/projects/python, should I report my problem report...
2
by: Robert Hancock | last post by:
Python 2.5.2 (r252:60911, Aug 28 2008, 23:51:17) on linux2 Type "help", "copyright", "credits" or "license" for more information. Traceback (most recent call last): File "<stdin>", line 1, in...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.