473,287 Members | 1,978 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,287 software developers and data experts.

Splitting URLs

I'm trying to split a URL into components. For example:

URL = 'http://steve:se****@www.domain.com.au:82/dir" + \
'ectory/file.html;params?query#fragment'
(joining the strings above with plus has no significance, it's just to
avoid word-wrapping)

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:se****@www.domain.com.au:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:

('http', 'steve:se****@www.domain.com.au:82', '/directory/file.html',
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.

This second question isn't specifically Python related, but I'm asking it
anyway...

I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?

(If it does, I might decide to live with the lack... it depends on how
odd the corner is, and how much work it takes to fix.)

--
Steven.
Oct 21 '07 #1
4 3790
URL = 'http://steve:se****@www.domain.com.au:82/dir" + \
'ectory/file.html;params?query#fragment'

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:se****@www.domain.com.au:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com.au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlparse: it will split the URL
into a tuple:

('http', 'steve:se****@www.domain.com.au:82', '/directory/file.html',
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.
there are some goodies in urllib for doing some of this
splitting. Example code at the bottom of my reply (though it
seems to choke on certain protocols such as "mailto:" and "ssh:"
because urlparse doesn't return the netloc properly)
This second question isn't specifically Python related, but I'm asking it
anyway...

I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?
I believe that dots are the sanctioned separator, HOWEVER, you
can have a non-qualified machine-name with local scope, so you
can easily have NO TLD, such as

http://user:password@localhost:8000/path/to/thing

There's also the ambiguity of what "TLD" means if you use IP
addresses:

http://user:pa******@192.168.1.1:8000/path/to/thing

Does that make the TLD "1"? Other odd edge-cases that are
usually allowable (but frowned upon, mostly used by
spammers/phishers) include using a long-int as the domain-name,
such as

http://user:password@2130706433:8000/path/to/thing

In an attempt to play with these functions, I present the code below.

-tkc
import urlparse, urllib
tests = (
'http://steve:se****@www.example.com.au:82/'
'directory/file.html;params?query#fragment',
'http://user:pa******@192.168.1.2/path/to/thing/',
'http://192.168.1.2/path/to/thing/',
'http://2130706433/path/to/thing/',
'http://localhost/path/to/thing/',
'http://user:password@localhost/path/to/thing/',
'telnet://fo*@bar.com',
'ssh://us**@example.com',
'gopher://wais.example.edu',
'svn+ssh://user:pa******@svn.example.com/svn/here/there/',
'mailto:jo*@example.com',
)

def is_ip_address(s):
for i, part in enumerate(s.split('.')):
try:
assert 0 <= int(i) <= 255
except:
return False
return i == 3

def steve_parse(url):
(scheme, netloc, path,
params, query, fragment) = urlparse.urlparse(url)
creds, host = urllib.splituser(netloc)
username, password = urllib.splitpasswd(creds or '')
host, port = urllib.splitport(host)
if '.' in host and not is_ip_address(host):
domain, tld = host.rsplit('.', 1)
else:
domain = host
tld = ''
return (
scheme, username, password,
domain, tld, port,
path, params, query,
fragment)
if __name__ == '__main__':
for test in tests:
print test
(scheme, username, password,
domain, tld, port,
path, params, query,
fragment) = steve_parse(test)
print '\tScheme: ', scheme
print '\tUsername: ', username
print '\tPassword: ', password
print '\tDomain: ', domain
print '\tTLD: ', tld
print '\tPort: ', port
print '\tPath: ', path
print '\tParams: ', params
print '\tQuery: ', query
print '\tFragment: ', fragment
print '='*50

Oct 21 '07 #2
On Sun, 21 Oct 2007 14:55:01 -0500, Tim Chase wrote:
there are some goodies in urllib for doing some of this splitting.
Example code at the bottom of my reply (though it seems to choke on
certain protocols such as "mailto:" and "ssh:" because urlparse doesn't
return the netloc properly)
It doesn't? That's... bad. But for my application, probably not
important: I only care about HTTP.

Thanks for the reply and sample code.
--
Steven
Oct 21 '07 #3
>there are some goodies in urllib for doing some of this splitting.
>Example code at the bottom of my reply (though it seems to choke on
certain protocols such as "mailto:" and "ssh:" because urlparse doesn't
return the netloc properly)

It doesn't? That's... bad. But for my application, probably not
important: I only care about HTTP.
This seems to be intentional, rather than a bug. In my
python2.4/urlparse.py file, there's a uses_netloc list which
clearly does not have 'mailto' in it. I can't give an
explanation/justification for it, but it seems to me (IMHO) that
there is a netloc involved in a mail address.

Or maybe I have a semantic misunderstanding of what the netloc
field means when returned from urlparse.urlparse However, since
this is where the hostname appears in "http", it makes me think
that the hostname from a mailto URL should also appear in this
result field.

-tkc

Oct 22 '07 #4
On 22 Okt, 03:53, Tim Chase <python.l...@tim.thechases.comwrote:
>
This seems to be intentional, rather than a bug. In my
python2.4/urlparse.py file, there's a uses_netloc list which
clearly does not have 'mailto' in it. I can't give an
explanation/justification for it, but it seems to me (IMHO) that
there is a netloc involved in a mail address.
As is often the case with the standard library, there are various open
issues around the functionality:

http://bugs.python.org/issue?%40filt..._text=RFC+3986

This proposed module (in the above search results) attempts to
implement RFC 3986:

http://bugs.python.org/issue1500504

I'm not sure whether itools.uri goes as far as you might like:

http://download.ikaaro.org/doc/itools/chapter--uri.html

Either way, after listening to Ron Stephens' most recent Python411
podcast, where he mentions that it's apparently up to the community to
fix the standard library (according to GvR and the core developers),
perhaps there's some demand for a "Python 300" which just cleans up
the standard library in a potentially (but not necessarily) backwards-
incompatible fashion.

Paul

Oct 22 '07 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: phpkid | last post by:
Howdy I've been given conflicting answers about search engines picking up urls like: http://mysite.com/index.php?var1=1&var2=2&var3=3 Do search engines pick up these urls? I've been considering...
4
by: tgiles | last post by:
Hi, all. Been staring at this for a couple of hours now and I find myself completely bewildered. I suppose it doesn't help that I'm a php newbie. Nevertheless, I throw myself at your mercy. I...
3
by: Patrick Coleman | last post by:
Hi, I'm looking for a function to split urls into their component parts, ie protocol, host, path, filename, extension. I'm really only looking for path and hostname (so I can download a webpage...
3
by: Rakesh | last post by:
Hi, I was 'googling' to look out for some ways of optimizing the code and came across this term - 'hot / cold splitting'. In short, the discussion is about splitting heavily accessed ( hot )...
26
by: Howard Brazee | last post by:
I would like to click on a URL of a html document that will open several URLs at once for me. Does someone have an example of a html document that will do this?
1
by: DM | last post by:
I'm working on a site with more than 1700 HTML files. We'll be moving files around on this site a lot because we're reorganizing it. I'm thinking of writing a script that will convert all URLs in...
10
by: jflash | last post by:
Hello all, I feel dumb having to ask this question in the first place, but I just can not figure it out. I am wanting to set my site up using dynamic urls (I'm assuming that's what they're...
2
by: shadow_ | last post by:
Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to...
3
by: WebCM | last post by:
How to apply nice URL-s into CMS? 1. Should we use nice urls for every page? 2. Do we need to put a FULL path into <a href="">? 3. What is faster and better? a) 10 rules in .htaccess...
2
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 7 Feb 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:30 (7.30PM). In this month's session, the creator of the excellent VBE...
0
by: MeoLessi9 | last post by:
I have VirtualBox installed on Windows 11 and now I would like to install Kali on a virtual machine. However, on the official website, I see two options: "Installer images" and "Virtual machines"....
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: Aftab Ahmad | last post by:
Hello Experts! I have written a code in MS Access for a cmd called "WhatsApp Message" to open WhatsApp using that very code but the problem is that it gives a popup message everytime I clicked on...
0
by: Aftab Ahmad | last post by:
So, I have written a code for a cmd called "Send WhatsApp Message" to open and send WhatsApp messaage. The code is given below. Dim IE As Object Set IE =...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.