473,626 Members | 3,191 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

URL parsing for the hard cases

Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.

I'm parsing URLs used by hostile sites, and the wierd cases come up
all too frequently.

John Nagle
Jul 22 '07 #1
7 3783
On 7/22/07, John Nagle wrote:
Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.
What do you mean by "parse" the field? What do you want to get back
from the parser function?
Jul 22 '07 #2
On 22 Jul, 18:56, John Nagle <na...@animats. comwrote:
Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.

I'm parsing URLs used by hostile sites, and the wierd cases come up
all too frequently.
I assume that when you say "netloc" you are referring to the second
field returned by the urlparse module. If this netloc contains an IPv6
address then it will also contain square brackets. The colons inside
the [] belong to the IPv6 address and the single possible colon
outside the brackets belongs to the port number. Of course, you might
want to try to help people who do not follow the RFCs and failed to
wrap the IPv6 address in square brackets. In that case, try...expect
comes in handy. You can try to parse an IPv6 address and if it fails
because of too many segments, then fallback to some other behaviour.

The worst case is a URL like http://2001::123:4567:abcd:8080/something.
Does the 8080 refer to a port number or part of the IPv6 address. If I
had to support non-bracketed IPv6 addresses, then I would interpret
this as http://[2001::123:4567:abcd]:8080/something.

RFC3986 is the reference for correct URL formats.

Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon

Jul 22 '07 #3
me******@yahoo. com wrote:
Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon
You wish. Hex input of IP addresses is allowed:

http://0x525eedda

and

http://0x52.0x5e.0xed.0xda

are both "Python.org ". Or just put

0x52.0x5e.0xed. 0xda

into the address bar of a browser. All these work in Firefox on Windows and
are recognized as valid IP addresses.

On the other hand,

0x52.com

is a valid domain name, in use by PairNIC.

But

http://test.0xda

is handled by Firefox on Windows as a domain name. It doesn't resolve, but it's
sent to DNS.

So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and there are
no more than four of them, it's an IP address. Otherwise it's a domain name.

There are phishing sites that pull stuff like this, and I'm parsing a long list
of such sites. So I really do need to get the hard cases right.

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

John Nagle
Jul 23 '07 #4
Here's another hard case. This one might be a bug in urlparse:

import urlparse

s = 'ftp://administrator:p a******@64.105. 135.30/originals/6 june
07/ebay/login/ebayisapi.html'

urlparse.urlpar se(s)

yields:

(u'ftp', u'administrator :pa******@64.10 5.135.30', u'/originals/6 june
07/ebay/login/ebayisapi.html' , '', '', '')

That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.

That's a real URL, from a search for phishing sites. There are lots
of hostile URLs out there. Some of which can fool some parsers.

John Nagle

John Nagle wrote:
me******@yahoo. com wrote:
>Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon

You wish. Hex input of IP addresses is allowed:

http://0x525eedda

and

http://0x52.0x5e.0xed.0xda

are both "Python.org ". Or just put

0x52.0x5e.0xed. 0xda

into the address bar of a browser. All these work in Firefox on Windows
and
are recognized as valid IP addresses.

On the other hand,

0x52.com

is a valid domain name, in use by PairNIC.

But

http://test.0xda

is handled by Firefox on Windows as a domain name. It doesn't resolve,
but it's
sent to DNS.

So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and
there are
no more than four of them, it's an IP address. Otherwise it's a domain
name.

There are phishing sites that pull stuff like this, and I'm parsing a
long list
of such sites. So I really do need to get the hard cases right.

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

John Nagle
Jul 23 '07 #5
On 7/23/07, John Nagle wrote:
Here's another hard case. This one might be a bug in urlparse:

import urlparse

s = 'ftp://administrator:p a******@64.105. 135.30/originals/6 june
07/ebay/login/ebayisapi.html'

urlparse.urlpar se(s)

yields:

(u'ftp', u'administrator :pa******@64.10 5.135.30', u'/originals/6 june
07/ebay/login/ebayisapi.html' , '', '', '')

That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.
Those values aren't "moved" to the fields; they're extracted on the
fly from the netloc. Use the .hostname property of the result tuple
to get just the hostname.

-Miles
Jul 23 '07 #6
On 7/22/07, John Nagle wrote:
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?
import re, string

NETLOC_RE = re.compile(r''' ^ # start of string
(?:([^@])+@)? # 1:
(?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
([^\[\]:]+)) # 3: IPv4 addr or reg-name
(?::(\d+))? # 4: optional port
$''', re.VERBOSE) # end of string

def normalize_IPv4( netloc):
try: # Assume it's an IP; if it's not, catch the error and return None
host = NETLOC_RE.match (netloc).group( 3)
octets = [string.atoi(o, 0) for o in host.split('.')]
assert len(octets) <= 4
for i in range(len(octet s), 4):
octets[i-1:] = divmod(octets[i-1], 256**(4-i))
for o in octets: assert o < 256
host = '.'.join(str(o) for o in octets)
except (AssertionError , ValueError, AttributeError) : return None
return host
def is_ip(netloc):
if normalize_IPv4( netloc) is None:
match = NETLOC_RE.match (netloc)
# IPv6 validation could be stricter
if match and match.group(2): return True
else: return False
return True

The first function, I'd imagine, is the more interesting of the two.

-Miles
Jul 23 '07 #7
On 7/23/07, Miles wrote:
On 7/22/07, John Nagle wrote:
Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

import re, string

NETLOC_RE = re.compile(r''' ^ # start of string
(?:([^@])+@)? # 1:
(?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
([^\[\]:]+)) # 3: IPv4 addr or reg-name
(?::(\d+))? # 4: optional port
$''', re.VERBOSE) # end of string

def normalize_IPv4( netloc):
try: # Assume it's an IP; if it's not, catch the error and return None
host = NETLOC_RE.match (netloc).group( 3)
octets = [string.atoi(o, 0) for o in host.split('.')]
assert len(octets) <= 4
for i in range(len(octet s), 4):
octets[i-1:] = divmod(octets[i-1], 256**(4-i))
for o in octets: assert o < 256
host = '.'.join(str(o) for o in octets)
except (AssertionError , ValueError, AttributeError) : return None
return host
Apparently this will generally work as well:

import re, socket

NETLOC_RE = ...

def normalize_IPv4( netloc):
try:
host = NETLOC_RE.match (netloc).group( 3)
return socket.inet_nto a(socket.inet_a ton(host))
except (AttributeError , socket.error):
return None

Thanks to http://mail.python.org/pipermail/pyt...ly/450317.html

-Miles
Jul 23 '07 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
9434
by: Gerrit Holl | last post by:
Posted with permission from the author. I have some comments on this PEP, see the (coming) followup to this message. PEP: 321 Title: Date/Time Parsing and Formatting Version: $Revision: 1.3 $ Last-Modified: $Date: 2003/10/28 19:48:44 $ Author: A.M. Kuchling <amk@amk.ca> Status: Draft Type: Standards Track
2
1561
by: Daniel Kramer | last post by:
Hello, I'm fairly new to python but I've written a script that takes in a special text file (a renderman .rib to be specific).. and filters some of the commands. The .rib file is a simple text file, but in some cases it's very large.. can be 20megs or more at times. The script steps though each line looking for keywords and changes the line if nessisary but most lines just pass in and out of the script un-modified. The problem is...
3
3651
by: Willem Ligtenberg | last post by:
I decided to use SAX to parse my xml file. But the parser crashes on: File "/usr/lib/python2.3/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: NCBI_Entrezgene.dtd:8:0: error in processing external entity reference This is caused by: <!DOCTYPE Entrezgene-Set PUBLIC "-//NCBI//NCBI Entrezgene/EN" "NCBI_Entrezgene.dtd">
9
2858
by: g.kanaka.raju | last post by:
Hi, I wanted to write a function that gets the file path as an argument. The file path make take one of the two forms : (i) Absolute file path e.g., /usr/local/file.log or (ii)it may be $USRLOCAL/file.${EXT} where USRLOCAL=/usr/local and $EXT=log
3
2252
by: Natalia DeBow | last post by:
Hi there, I have another question for .NET RegEx experts. I am reading in a C Sharp file line by line and I am trying to detect comments that start with either // of ///. What I am particularly interested is the comments themselves. I am interested in some stats with regards to the amount of comments in the file (comment bytes). So, I tried several regular expressions, but they don't seem to work in
1
1888
by: sp | last post by:
i have an xml file (an rss file) <?xml version="1.0" ?> <rss version="2.0"> <channel> <title>CodeGuru.com</title> <link>http://www.codeguru.com/</link> <description>The number one developer site!</description> <language>en-us</language> <lastBuildDate>Mon, 13 Feb 2006 09:52:05 EST</lastBuildDate>
2
5809
by: aditya.raghunath | last post by:
hi, I'm trying to read text files and then parse them. Some of these files are of several 100 Mbytes or in some cases GBytes. Reading them using the getline function slows down my program a lot, takes more than 15-20 min to read them. I want to know efficient ways to read these files. Any ideas?? TIA Aditya
15
1714
by: linq936 | last post by:
Hi, I am reading book <<Expert C Programming>>, it has the following quiz, a //* //*/ b In C and C++ compiler what does the above code trun out? I think it is simple for C compiler, it is a/b.
13
4486
by: Chris Carlen | last post by:
Hi: Having completed enough serial driver code for a TMS320F2812 microcontroller to talk to a terminal, I am now trying different approaches to command interpretation. I have a very simple command set consisting of several single letter commands which take no arguments. A few additional single letter commands take arguments:
0
8268
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8707
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8641
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8366
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8510
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7199
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5575
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
1
2628
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
1
1812
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.