Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.
I'm parsing URLs used by hostile sites, and the wierd cases come up
all too frequently.
John Nagle 7 3768
On 7/22/07, John Nagle wrote:
Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.
What do you mean by "parse" the field? What do you want to get back
from the parser function?
On 22 Jul, 18:56, John Nagle <na...@animats.comwrote:
Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.
I'm parsing URLs used by hostile sites, and the wierd cases come up
all too frequently.
I assume that when you say "netloc" you are referring to the second
field returned by the urlparse module. If this netloc contains an IPv6
address then it will also contain square brackets. The colons inside
the [] belong to the IPv6 address and the single possible colon
outside the brackets belongs to the port number. Of course, you might
want to try to help people who do not follow the RFCs and failed to
wrap the IPv6 address in square brackets. In that case, try...expect
comes in handy. You can try to parse an IPv6 address and if it fails
because of too many segments, then fallback to some other behaviour.
The worst case is a URL like http://2001::123:4567:abcd:8080/something.
Does the 8080 refer to a port number or part of the IPv6 address. If I
had to support non-bracketed IPv6 addresses, then I would interpret
this as http://[2001::123:4567:abcd]:8080/something.
RFC3986 is the reference for correct URL formats.
Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.
--Michael Dillon me******@yahoo.com wrote:
Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.
--Michael Dillon
You wish. Hex input of IP addresses is allowed: http://0x525eedda
and http://0x52.0x5e.0xed.0xda
are both "Python.org". Or just put
0x52.0x5e.0xed.0xda
into the address bar of a browser. All these work in Firefox on Windows and
are recognized as valid IP addresses.
On the other hand,
0x52.com
is a valid domain name, in use by PairNIC.
But http://test.0xda
is handled by Firefox on Windows as a domain name. It doesn't resolve, but it's
sent to DNS.
So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and there are
no more than four of them, it's an IP address. Otherwise it's a domain name.
There are phishing sites that pull stuff like this, and I'm parsing a long list
of such sites. So I really do need to get the hard cases right.
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?
John Nagle
Here's another hard case. This one might be a bug in urlparse:
import urlparse
s = 'ftp://administrator:pa******@64.105.135.30/originals/6 june
07/ebay/login/ebayisapi.html'
urlparse.urlparse(s)
yields:
(u'ftp', u'administrator:pa******@64.105.135.30', u'/originals/6 june
07/ebay/login/ebayisapi.html', '', '', '')
That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.
That's a real URL, from a search for phishing sites. There are lots
of hostile URLs out there. Some of which can fool some parsers.
John Nagle
John Nagle wrote:
me******@yahoo.com wrote:
>Once you eliminate IPv6 addresses, parsing is simple. Is there a colon? Then there is a port number. Does the left over have any characters not in [0123456789.]? Then it is a name, not an IPv4 address.
--Michael Dillon
You wish. Hex input of IP addresses is allowed:
http://0x525eedda
and
http://0x52.0x5e.0xed.0xda
are both "Python.org". Or just put
0x52.0x5e.0xed.0xda
into the address bar of a browser. All these work in Firefox on Windows
and
are recognized as valid IP addresses.
On the other hand,
0x52.com
is a valid domain name, in use by PairNIC.
But
http://test.0xda
is handled by Firefox on Windows as a domain name. It doesn't resolve,
but it's
sent to DNS.
So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and
there are
no more than four of them, it's an IP address. Otherwise it's a domain
name.
There are phishing sites that pull stuff like this, and I'm parsing a
long list
of such sites. So I really do need to get the hard cases right.
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?
John Nagle
On 7/23/07, John Nagle wrote:
Here's another hard case. This one might be a bug in urlparse:
import urlparse
s = 'ftp://administrator:pa******@64.105.135.30/originals/6 june
07/ebay/login/ebayisapi.html'
urlparse.urlparse(s)
yields:
(u'ftp', u'administrator:pa******@64.105.135.30', u'/originals/6 june
07/ebay/login/ebayisapi.html', '', '', '')
That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.
Those values aren't "moved" to the fields; they're extracted on the
fly from the netloc. Use the .hostname property of the result tuple
to get just the hostname.
-Miles
On 7/22/07, John Nagle wrote:
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?
import re, string
NETLOC_RE = re.compile(r'''^ # start of string
(?:([^@])+@)? # 1:
(?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
([^\[\]:]+)) # 3: IPv4 addr or reg-name
(?::(\d+))? # 4: optional port
$''', re.VERBOSE) # end of string
def normalize_IPv4(netloc):
try: # Assume it's an IP; if it's not, catch the error and return None
host = NETLOC_RE.match(netloc).group(3)
octets = [string.atoi(o, 0) for o in host.split('.')]
assert len(octets) <= 4
for i in range(len(octets), 4):
octets[i-1:] = divmod(octets[i-1], 256**(4-i))
for o in octets: assert o < 256
host = '.'.join(str(o) for o in octets)
except (AssertionError, ValueError, AttributeError): return None
return host
def is_ip(netloc):
if normalize_IPv4(netloc) is None:
match = NETLOC_RE.match(netloc)
# IPv6 validation could be stricter
if match and match.group(2): return True
else: return False
return True
The first function, I'd imagine, is the more interesting of the two.
-Miles
On 7/23/07, Miles wrote:
On 7/22/07, John Nagle wrote:
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?
import re, string
NETLOC_RE = re.compile(r'''^ # start of string
(?:([^@])+@)? # 1:
(?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
([^\[\]:]+)) # 3: IPv4 addr or reg-name
(?::(\d+))? # 4: optional port
$''', re.VERBOSE) # end of string
def normalize_IPv4(netloc):
try: # Assume it's an IP; if it's not, catch the error and return None
host = NETLOC_RE.match(netloc).group(3)
octets = [string.atoi(o, 0) for o in host.split('.')]
assert len(octets) <= 4
for i in range(len(octets), 4):
octets[i-1:] = divmod(octets[i-1], 256**(4-i))
for o in octets: assert o < 256
host = '.'.join(str(o) for o in octets)
except (AssertionError, ValueError, AttributeError): return None
return host
Apparently this will generally work as well:
import re, socket
NETLOC_RE = ...
def normalize_IPv4(netloc):
try:
host = NETLOC_RE.match(netloc).group(3)
return socket.inet_ntoa(socket.inet_aton(host))
except (AttributeError, socket.error):
return None
Thanks to http://mail.python.org/pipermail/pyt...ly/450317.html
-Miles This thread has been closed and replies have been disabled. Please start a new discussion. Similar topics
by: Gerrit Holl |
last post by:
Posted with permission from the author.
I have some comments on this PEP, see the (coming) followup to this message.
PEP: 321
Title: Date/Time Parsing and Formatting
Version: $Revision: 1.3 $...
|
by: Daniel Kramer |
last post by:
Hello, I'm fairly new to python but I've written a script that takes
in a special text file (a renderman .rib to be specific).. and filters
some of the commands. The .rib file is a simple text...
|
by: Willem Ligtenberg |
last post by:
I decided to use SAX to parse my xml file.
But the parser crashes on:
File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception...
|
by: g.kanaka.raju |
last post by:
Hi,
I wanted to write a function that gets the file path as an argument.
The file
path make take one of the two forms :
(i) Absolute file path e.g., /usr/local/file.log or
(ii)it may...
|
by: Natalia DeBow |
last post by:
Hi there,
I have another question for .NET RegEx experts.
I am reading in a C Sharp file line by line and I am trying to detect
comments that start with either // of ///. What I am...
|
by: sp |
last post by:
i have an xml file (an rss file)
<?xml version="1.0" ?>
<rss version="2.0">
<channel>
<title>CodeGuru.com</title>
<link>http://www.codeguru.com/</link>
<description>The number one developer...
|
by: aditya.raghunath |
last post by:
hi,
I'm trying to read text files and then parse them. Some of these files
are of several 100 Mbytes or in some cases GBytes. Reading them using
the getline function slows down my program a lot,...
|
by: linq936 |
last post by:
Hi,
I am reading book <<Expert C Programming>>, it has the following
quiz,
a //*
//*/ b
In C and C++ compiler what does the above code trun out?
I think it is simple for C compiler, it is...
|
by: Chris Carlen |
last post by:
Hi:
Having completed enough serial driver code for a TMS320F2812
microcontroller to talk to a terminal, I am now trying different
approaches to command interpretation.
I have a very simple...
|
by: taylorcarr |
last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
|
by: Charles Arthur |
last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
|
by: ryjfgjl |
last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
|
by: emmanuelkatto |
last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud.
Please let me know.
Thanks!
Emmanuel
|
by: BarryA |
last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
|
by: nemocccc |
last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
|
by: Hystou |
last post by:
There are some requirements for setting up RAID:
1. The motherboard and BIOS support RAID configuration.
2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
|
by: marktang |
last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
|
by: Oralloy |
last post by:
Hello folks,
I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>".
The problem is that using the GNU compilers,...
| |