By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,271 Members | 1,722 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,271 IT Pros & Developers. It's quick & easy.

Sort by domain name?

P: n/a
js
Hi list,

I have a list of URL and I want to sort that list by the domain name.

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be excluded.

For example, if the list was the following
------------------------------------------------------------
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com
------------------------------------------------------------

the sort's output would be
------------------------------------------------------------
http://google.com
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://mail.yahoo.com
------------------------------------------------------------

As you can see above, I don't want to
Thanks in advance.
Oct 2 '06 #1
Share this Question
Share on Google+
12 Replies


P: n/a
"js " <eb*****@gmail.comwrites:
Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.
It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say
how to treat various countries.
Oct 2 '06 #2

P: n/a
>Here, domain name doesn't contain subdomain, or should I
>say, domain's part of 'www', mail, news and en should be
excluded.

It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say how to treat
various countries.
In addition, you get very different results even on just "base"
domain-name, such as "whitehouse" based on whether you use the
".gov" or ".com" variant of the TLD. Thus, I'm not sure there's
any way to discern this example from the "yahoo.com" vs.
"yahoo.co.uk" variant without doing a boatload of WHOIS queries,
which in turn might be misleading anyways.

A first-pass solution might look something like:

################################################## ############>>>
sites
['http://mail.google.com', 'http://reader.google.com',
'http://mail.yahoo.co.uk', 'http://google.com',
'http://mail.yahoo.com']
>>sitebits = [site.lower().lstrip('http://').split('.') for
site in sites]
>>for site in sitebits: site.reverse()
....
>>sorted(sitebits)
[['com', 'google'], ['com', 'google', 'mail'], ['com', 'google',
'reader'], ['co
m', 'yahoo', 'mail'], ['uk', 'co', 'yahoo', 'mail']]
>>results = ['http://' + ('.'.join(reversed(site))) for site
in sorted(sitebits)]
>>results
['http://google.com', 'http://mail.google.com',
'http://reader.google.com', 'http://mail.yahoo.com',
'http://mail.yahoo.co.uk']
################################################## ############

which can be wrapped up like this:

################################################## ############
>>def sort_by_domain(sites):
.... sitebits = [site.lower().lstrip('http://').split('.') for
site in sites]
.... for site in sitebits: site.reverse()
.... return ['http://' + ('.'.join(reversed(site))) for site
in sorted(sitebits)]
....
>>s = sites
sort_by_domain(sites)
['http://google.com', 'http://mail.google.com',
'http://reader.google.com', 'http://mail.yahoo.com',
'http://mail.yahoo.co.uk']
################################################## ############

to give you a sorting function. It assumes http rather than
having mixed url-types, such as ftp or mailto. They're easy
enough to strip off as well, but putting them back on becomes a
little more exercise.

Just a few ideas,

-tkc


Oct 2 '06 #3

P: n/a

Paul Rubin wrote:
"js " <eb*****@gmail.comwrites:
Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say
how to treat various countries.
Plus, how do you order "https:", "ftp", URLs with "www.", "www2." ,
named anchors etc?

Gentle reminder: is this homework? And you can expect better responses
if you show youve bootstrapped yourself on the problem to some extent.

Oct 2 '06 #4

P: n/a
js
Thanks for your quick reply.
yeah, it's a hard task and unfortunately even google doesn't help me much.

All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.
Oct 2 '06 #5

P: n/a
Tim Chase:
to give you a sorting function. It assumes http rather than
having mixed url-types, such as ftp or mailto. They're easy
enough to strip off as well, but putting them back on becomes a
little more exercise.
With a modern Python you don't need to do all that work, you can do:

sorted(urls, key=cleaner)

Where cleaner is a function the finds the important part of a string of
the ones you have to sort.

Bye,
bearophile

Oct 2 '06 #6

P: n/a
js:
All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.
Then if you can define a good enough list of such company names, you
can just do a search of such names inside each url.
Maybe you can use string method, or a RE, or create a big string with
all the company names and perform a longest common subsequence search
using the stdlib function.

Bye,
bearophile

Oct 2 '06 #7

P: n/a

gene tani wrote:
Plus, how do you order "https:", "ftp", URLs with "www.", "www2." ,
named anchors etc?
Now is a good time to point out the urlparse module in the standard
library. It will help the OP with all of this stuff.

just adding my 2 cents.

....
jay graves

Oct 2 '06 #8

P: n/a
"js " <eb*****@gmail.comwrites:
All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.
Here's a function I used to use. It makes no attempt to be
exhaustive, but did a reasonable job on the domains I cared about at
the time:

def host_domain(hostname):
parts = hostname.split('.')
if parts[-1] in ('au','uk','nz', 'za', 'jp', 'br'):
# www.foobar.co.uk, etc
host_len = 3
elif len(parts)==4 and re.match('^[\d.]+$', hostname):
host_len = 4 # 2.3.4.5 numeric address
else:
host_len = 2
d = '.'.join(parts[-(host_len):])
# print 'host_domain:', hostname, '=>', d
return d
Oct 2 '06 #9

P: n/a
"js " <eb*****@gmail.comwrote in message
news:ma***************************************@pyt hon.org...
Hi list,

I have a list of URL and I want to sort that list by the domain name.

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

For example, if the list was the following
------------------------------------------------------------
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com
------------------------------------------------------------

the sort's output would be
------------------------------------------------------------
http://google.com
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://mail.yahoo.com
------------------------------------------------------------

As you can see above, I don't want to
Thanks in advance.
How about sorting the strings as they are reversed?

urls = """\
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com""".split("\n")

sortedList = [ su[1] for su in sorted([ (u[::-1],u) for u in urls ]) ]

for url in sortedList:
print url
Prints:
http://mail.yahoo.co.uk
http://mail.google.com
http://reader.google.com
http://google.com
http://mail.yahoo.com
Close to what you are looking for, might be good enough?

-- Paul
Oct 2 '06 #10

P: n/a
js
Gentle reminder: is this homework? And you can expect better responses
if you show youve bootstrapped yourself on the problem to some extent.
Sure thing.
First I tried to solve this by using a list of domain found at
http://www.neuhaus.com/domaincheck/domain_list.htm

I converted this to a list (in python) and tried like below

look for url that endswith(domain in domains)
if found:
capture the left side of the domain part(tld) and
save all url to a dictionary that key is the captured string

to me this seems to work but stuck because this solution seems no good.
Oct 2 '06 #11

P: n/a
js
On 2 Oct 2006 08:56:09 -0700, be************@lycos.com
<be************@lycos.comwrote:
js:
All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Then if you can define a good enough list of such company names, you
can just do a search of such names inside each url.
Maybe you can use string method, or a RE, or create a big string with
all the company names and perform a longest common subsequence search
using the stdlib function.
well, I think list is so large that that's impossible to
create such a good company-list.
Oct 2 '06 #12

P: n/a
js
How about sorting the strings as they are reversed?
>
urls = """\
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com""".split("\n")

sortedList = [ su[1] for su in sorted([ (u[::-1],u) for u in urls ]) ]

for url in sortedList:
print url
<snip>
>
Close to what you are looking for, might be good enough?
Great... I couldn't thought that way. Thanks a lot!
Oct 2 '06 #13

This discussion thread is closed

Replies have been disabled for this discussion.