URL = 'http://steve:se****@www.domain.com.au:82/dir" + \
'ectory/file.html;params?query#fragment'
If I split the URL, I would like to get the following components:
scheme = 'http'
netloc = 'steve:se****@www.domain.com.au:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'
I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:
('http', 'steve:se****@www.domain.com.au:82', '/directory/file.html',
'params', 'query', 'fragment')
If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.
there are some goodies in urllib for doing some of this
splitting. Example code at the bottom of my reply (though it
seems to choke on certain protocols such as "mailto:" and "ssh:"
because urlparse doesn't return the netloc properly)
This second question isn't specifically Python related, but I'm asking it
anyway...
I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?
I believe that dots are the sanctioned separator, HOWEVER, you
can have a non-qualified machine-name with local scope, so you
can easily have NO TLD, such as
http://user:password@localhost:8000/path/to/thing
There's also the ambiguity of what "TLD" means if you use IP
addresses:
http://user:pa******@192.168.1.1:8000/path/to/thing
Does that make the TLD "1"? Other odd edge-cases that are
usually allowable (but frowned upon, mostly used by
spammers/phishers) include using a long-int as the domain-name,
such as
http://user:password@2130706433:8000/path/to/thing
In an attempt to play with these functions, I present the code below.
-tkc
import urlparse, urllib
tests = (
'http://steve:se****@www.example.com.au:82/'
'directory/file.html;params?query#fragment',
'http://user:pa******@192.168.1.2/path/to/thing/',
'http://192.168.1.2/path/to/thing/',
'http://2130706433/path/to/thing/',
'http://localhost/path/to/thing/',
'http://user:password@localhost/path/to/thing/',
'telnet://fo*@bar.com',
'ssh://us**@example.com',
'gopher://wais.example.edu',
'svn+ssh://user:pa******@svn.example.com/svn/here/there/',
'mailto:jo*@example.com',
)
def is_ip_address(s):
for i, part in enumerate(s.split('.')):
try:
assert 0 <= int(i) <= 255
except:
return False
return i == 3
def steve_parse(url):
(scheme, netloc, path,
params, query, fragment) = urlparse.urlparse(url)
creds, host = urllib.splituser(netloc)
username, password = urllib.splitpasswd(creds or '')
host, port = urllib.splitport(host)
if '.' in host and not is_ip_address(host):
domain, tld = host.rsplit('.', 1)
else:
domain = host
tld = ''
return (
scheme, username, password,
domain, tld, port,
path, params, query,
fragment)
if __name__ == '__main__':
for test in tests:
print test
(scheme, username, password,
domain, tld, port,
path, params, query,
fragment) = steve_parse(test)
print '\tScheme: ', scheme
print '\tUsername: ', username
print '\tPassword: ', password
print '\tDomain: ', domain
print '\tTLD: ', tld
print '\tPort: ', port
print '\tPath: ', path
print '\tParams: ', params
print '\tQuery: ', query
print '\tFragment: ', fragment
print '='*50