Splitting URLs

Steven D'Aprano

I'm trying to split a URL into components. For example:

URL = 'http://steve:se****@ww w.domain.com.au :82/dir" + \
'ectory/file.html;param s?query#fragmen t'
(joining the strings above with plus has no significance, it's just to
avoid word-wrapping)

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:se****@w ww.domain.com.a u:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com .au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlpar se: it will split the URL
into a tuple:

('http', 'steve:se****@w ww.domain.com.a u:82', '/directory/file.html',
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.

This second question isn't specifically Python related, but I'm asking it
anyway...

I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?

(If it does, I might decide to live with the lack... it depends on how
odd the corner is, and how much work it takes to fix.)

--
Steven.

Oct 21 '07 #1

Subscribe Reply

3829

Tim Chase

URL = 'http://steve:se****@ww w.domain.com.au :82/dir" + \

'ectory/file.html;param s?query#fragmen t'

If I split the URL, I would like to get the following components:

scheme = 'http'
netloc = 'steve:se****@w ww.domain.com.a u:82'
username = 'steve'
password = 'secret'
hostname = 'www.domain.com .au'
port = 82
path = '/directory/file.html'
parameters = 'params'
query = 'query'
fragment = 'fragment'

I can get *most* of the way with urlparse.urlpar se: it will split the URL
into a tuple:

('http', 'steve:se****@w ww.domain.com.a u:82', '/directory/file.html',
'params', 'query', 'fragment')

If I'm using Python 2.5, I can split the netloc field further with named
attributes. Unfortunately, I can't rely on Python 2.5 (for my sins I have
to support 2.4). Before I write code to split the netloc field by hand (a
nuisance, but doable) I thought I'd ask if there was a function somewhere
in the standard library I had missed.

there are some goodies in urllib for doing some of this
splitting. Example code at the bottom of my reply (though it
seems to choke on certain protocols such as "mailto:" and "ssh:"
because urlparse doesn't return the netloc properly)

This second question isn't specifically Python related, but I'm asking it
anyway...

I'd also like to split the domain part of a HTTP netloc into top level
domain (.au), second level (.com), etc. I don't need to validate the TLD,
I just need to split it. Is splitting on dots sufficient, or will that
miss some odd corner case of the HTTP specification?

I believe that dots are the sanctioned separator, HOWEVER, you
can have a non-qualified machine-name with local scope, so you
can easily have NO TLD, such as

http://user:password@localhost:8000/path/to/thing

There's also the ambiguity of what "TLD" means if you use IP
addresses:

http://user:pa******@192.168.1.1:8000/path/to/thing

Does that make the TLD "1"? Other odd edge-cases that are
usually allowable (but frowned upon, mostly used by
spammers/phishers) include using a long-int as the domain-name,
such as

http://user:password@2130706433:8000/path/to/thing

In an attempt to play with these functions, I present the code below.

-tkc
import urlparse, urllib
tests = (
'http://steve:se****@ww w.example.com.a u:82/'
'directory/file.html;param s?query#fragmen t',
'http://user:pa******@1 92.168.1.2/path/to/thing/',
'http://192.168.1.2/path/to/thing/',
'http://2130706433/path/to/thing/',
'http://localhost/path/to/thing/',
'http://user:password@l ocalhost/path/to/thing/',
'telnet://fo*@bar.com',
'ssh://us**@example.co m',
'gopher://wais.example.ed u',
'svn+ssh://user:pa******@s vn.example.com/svn/here/there/',
'mailto:jo*@exa mple.com',
)

def is_ip_address(s ):
for i, part in enumerate(s.spl it('.')):
try:
assert 0 <= int(i) <= 255
except:
return False
return i == 3

def steve_parse(url ):
(scheme, netloc, path,
params, query, fragment) = urlparse.urlpar se(url)
creds, host = urllib.splituse r(netloc)
username, password = urllib.splitpas swd(creds or '')
host, port = urllib.splitpor t(host)
if '.' in host and not is_ip_address(h ost):
domain, tld = host.rsplit('.' , 1)
else:
domain = host
tld = ''
return (
scheme, username, password,
domain, tld, port,
path, params, query,
fragment)
if __name__ == '__main__':
for test in tests:
print test
(scheme, username, password,
domain, tld, port,
path, params, query,
fragment) = steve_parse(tes t)
print '\tScheme: ', scheme
print '\tUsername: ', username
print '\tPassword: ', password
print '\tDomain: ', domain
print '\tTLD: ', tld
print '\tPort: ', port
print '\tPath: ', path
print '\tParams: ', params
print '\tQuery: ', query
print '\tFragment: ', fragment
print '='*50

Oct 21 '07 #2

Steven D'Aprano

On Sun, 21 Oct 2007 14:55:01 -0500, Tim Chase wrote:

there are some goodies in urllib for doing some of this splitting.
Example code at the bottom of my reply (though it seems to choke on
certain protocols such as "mailto:" and "ssh:" because urlparse doesn't
return the netloc properly)

It doesn't? That's... bad. But for my application, probably not
important: I only care about HTTP.

Thanks for the reply and sample code.
--
Steven

Oct 21 '07 #3

Tim Chase

>there are some goodies in urllib for doing some of this splitting.

>Example code at the bottom of my reply (though it seems to choke on
certain protocols such as "mailto:" and "ssh:" because urlparse doesn't
return the netloc properly)

It doesn't? That's... bad. But for my application, probably not
important: I only care about HTTP.

This seems to be intentional, rather than a bug. In my
python2.4/urlparse.py file, there's a uses_netloc list which
clearly does not have 'mailto' in it. I can't give an
explanation/justification for it, but it seems to me (IMHO) that
there is a netloc involved in a mail address.

Or maybe I have a semantic misunderstandin g of what the netloc
field means when returned from urlparse.urlpar se However, since
this is where the hostname appears in "http", it makes me think
that the hostname from a mailto URL should also appear in this
result field.

-tkc

Oct 22 '07 #4

Paul Boddie

On 22 Okt, 03:53, Tim Chase <python.l...@ti m.thechases.com wrote:

>
This seems to be intentional, rather than a bug. In my
python2.4/urlparse.py file, there's a uses_netloc list which
clearly does not have 'mailto' in it. I can't give an
explanation/justification for it, but it seems to me (IMHO) that
there is a netloc involved in a mail address.

As is often the case with the standard library, there are various open
issues around the functionality:

http://bugs.python.org/issue?%40filt..._text=RFC+3986

This proposed module (in the above search results) attempts to
implement RFC 3986:

http://bugs.python.org/issue1500504

I'm not sure whether itools.uri goes as far as you might like:

http://download.ikaaro.org/doc/itools/chapter--uri.html

Either way, after listening to Ron Stephens' most recent Python411
podcast, where he mentions that it's apparently up to the community to
fix the standard library (according to GvR and the core developers),
perhaps there's some demand for a "Python 300" which just cleans up
the standard library in a potentially (but not necessarily) backwards-
incompatible fashion.

Paul

Oct 22 '07 #5

Similar topics

3433

PHP urls with variable data in search engine results

by: phpkid | last post by:

Howdy I've been given conflicting answers about search engines picking up urls like: http://mysite.com/index.php?var1=1&var2=2&var3=3 Do search engines pick up these urls? I've been considering converting a site of mine to PHP-Nuke, but if the individual modules aren't picked up in search engines I'm not going to do it. Thanks phpKid

PHP

1726

splitting lines in arrays?

by: tgiles | last post by:

Hi, all. Been staring at this for a couple of hours now and I find myself completely bewildered. I suppose it doesn't help that I'm a php newbie. Nevertheless, I throw myself at your mercy. I have an array which I am attempting to split off into a new array. The first array is just a whole bunch of links like so: http://www.example.com/query?track=http://www.somewhereelse.com/whatever.php

PHP

5559

Splitting URLs

by: Patrick Coleman | last post by:

Hi, I'm looking for a function to split urls into their component parts, ie protocol, host, path, filename, extension. I'm really only looking for path and hostname (so I can download a webpage over sockets using c++). Something equivilent to PHP's 'explode' function would be fine, or even better PHP's 'spliturl' function :). Alternatively, if someone could recommend a better way to download data (ie. ASCII) into an array of some type...

C / C++

4155

Discussion regarding hot/ cold splitting of structures.

by: Rakesh | last post by:

Hi, I was 'googling' to look out for some ways of optimizing the code and came across this term - 'hot / cold splitting'. In short, the discussion is about splitting heavily accessed ( hot ) portions of data structure from rarely accessed cold portions. I haven't used this one myself anytime before, but am interested in learning more about this. Can you please share your experience here, so that I can understand better and this could...

C / C++

12667

Open multiple URLs

by: Howard Brazee | last post by:

I would like to click on a URL of a html document that will open several URLs at once for me. Does someone have an example of a html document that will do this?

Javascript

1830

Is it good to use absolute URLs?

by: DM | last post by:

I'm working on a site with more than 1700 HTML files. We'll be moving files around on this site a lot because we're reorganizing it. I'm thinking of writing a script that will convert all URLs in href and src attributes to absolute URLs with this form: href="/somedir/somefile.htm" src="/images/somecategory/image.gif" That way, if you move a page from one directory to another, the links and image references within the page will not...

HTML / CSS

4947

Creating Dynamic URLs

by: jflash | last post by:

Hello all, I feel dumb having to ask this question in the first place, but I just can not figure it out. I am wanting to set my site up using dynamic urls (I'm assuming that's what they're called, an example of what I have in mind is index.php?page=). However, I can not figure out how to do this. I will eventually want to use SEF urls, but for now I'll be content just to have the dynamic urls. If anyone can tell me how to do this, I'd...

PHP

3283

Splitting function

by: shadow_ | last post by:

Hi i m new at C and trying to write a parser and a string class. Basicly program will read data from file and splits it into lines then lines to words. i used strtok function for splitting data to lines it worked quite well but srttok isnot working for multiple blank or commas. Can strtok do this kind of splitting if it cant what should i use . Unal

C / C++

3884

"Nice URLs" - how to implement it in PHP?

by: WebCM | last post by:

How to apply nice URL-s into CMS? 1. Should we use nice urls for every page? 2. Do we need to put a FULL path into <a href="">? 3. What is faster and better? a) 10 rules in .htaccess which redirect you to normal URLs with GET parameters

PHP

10037

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9879

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

11055

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9727

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

8099

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

7250

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5939

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...

Networking - Hardware / Configuration

6142

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

4337

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP