Sort by domain name?

Hi list,

I have a list of URL and I want to sort that list by the domain name.

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be excluded.

For example, if the list was the following
------------------------------------------------------------
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com
------------------------------------------------------------

the sort's output would be
------------------------------------------------------------
http://google.com
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://mail.yahoo.com
------------------------------------------------------------

As you can see above, I don't want to
Thanks in advance.

Oct 2 '06 #1

Subscribe Post Reply

2844

Paul Rubin

"js " <eb*****@gmail.comwrites:

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say
how to treat various countries.

Oct 2 '06 #2

Tim Chase

>Here, domain name doesn't contain subdomain, or should I

>say, domain's part of 'www', mail, news and en should be
excluded.

It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say how to treat
various countries.

In addition, you get very different results even on just "base"
domain-name, such as "whitehouse" based on whether you use the
".gov" or ".com" variant of the TLD. Thus, I'm not sure there's
any way to discern this example from the "yahoo.com" vs.
"yahoo.co.uk" variant without doing a boatload of WHOIS queries,
which in turn might be misleading anyways.

A first-pass solution might look something like:

################################################## ############>>>
sites
['http://mail.google.com', 'http://reader.google.com',
'http://mail.yahoo.co.uk', 'http://google.com',
'http://mail.yahoo.com']

>>sitebits = [site.lower().lstrip('http://').split('.') for

site in sites]

>>for site in sitebits: site.reverse()

....

>>sorted(sitebits)

[['com', 'google'], ['com', 'google', 'mail'], ['com', 'google',
'reader'], ['co
m', 'yahoo', 'mail'], ['uk', 'co', 'yahoo', 'mail']]

>>results = ['http://' + ('.'.join(reversed(site))) for site

in sorted(sitebits)]

>>results

['http://google.com', 'http://mail.google.com',
'http://reader.google.com', 'http://mail.yahoo.com',
'http://mail.yahoo.co.uk']
################################################## ############

which can be wrapped up like this:

################################################## ############

>>def sort_by_domain(sites):

.... sitebits = [site.lower().lstrip('http://').split('.') for
site in sites]
.... for site in sitebits: site.reverse()
.... return ['http://' + ('.'.join(reversed(site))) for site
in sorted(sitebits)]
....

>>s = sites
sort_by_domain(sites)

['http://google.com', 'http://mail.google.com',
'http://reader.google.com', 'http://mail.yahoo.com',
'http://mail.yahoo.co.uk']
################################################## ############

to give you a sorting function. It assumes http rather than
having mixed url-types, such as ftp or mailto. They're easy
enough to strip off as well, but putting them back on becomes a
little more exercise.

Just a few ideas,

-tkc

Oct 2 '06 #3

gene tani

Paul Rubin wrote:

"js " <eb*****@gmail.comwrites:
Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

It's a little more complicated, you have to treat co.uk about
the same way as .com, and similarly for some other countries
but not all. For example, subdomain.companyname.de versus
subdomain.companyname.com.au or subdomain.companyname.co.uk.
You end up needing a table or special code to say
how to treat various countries.

Plus, how do you order "https:", "ftp", URLs with "www.", "www2." ,
named anchors etc?

Gentle reminder: is this homework? And you can expect better responses
if you show youve bootstrapped yourself on the problem to some extent.

Oct 2 '06 #4

Thanks for your quick reply.
yeah, it's a hard task and unfortunately even google doesn't help me much.

All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Oct 2 '06 #5

bearophileHUGS

Tim Chase:

to give you a sorting function. It assumes http rather than
having mixed url-types, such as ftp or mailto. They're easy
enough to strip off as well, but putting them back on becomes a
little more exercise.

With a modern Python you don't need to do all that work, you can do:

sorted(urls, key=cleaner)

Where cleaner is a function the finds the important part of a string of
the ones you have to sort.

Bye,
bearophile

Oct 2 '06 #6

bearophileHUGS

js:

All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Then if you can define a good enough list of such company names, you
can just do a search of such names inside each url.
Maybe you can use string method, or a RE, or create a big string with
all the company names and perform a longest common subsequence search
using the stdlib function.

Bye,
bearophile

Oct 2 '06 #7

jay graves

gene tani wrote:

Plus, how do you order "https:", "ftp", URLs with "www.", "www2." ,
named anchors etc?

Now is a good time to point out the urlparse module in the standard
library. It will help the OP with all of this stuff.

just adding my 2 cents.

....
jay graves

Oct 2 '06 #8

Paul Rubin

"js " <eb*****@gmail.comwrites:

All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Here's a function I used to use. It makes no attempt to be
exhaustive, but did a reasonable job on the domains I cared about at
the time:

def host_domain(hostname):
parts = hostname.split('.')
if parts[-1] in ('au','uk','nz', 'za', 'jp', 'br'):
# www.foobar.co.uk, etc
host_len = 3
elif len(parts)==4 and re.match('^[\d.]+$', hostname):
host_len = 4 # 2.3.4.5 numeric address
else:
host_len = 2
d = '.'.join(parts[-(host_len):])
# print 'host_domain:', hostname, '=>', d
return d

Oct 2 '06 #9

Paul McGuire

"js " <eb*****@gmail.comwrote in message
news:ma***************************************@pyt hon.org...

Hi list,

I have a list of URL and I want to sort that list by the domain name.

Here, domain name doesn't contain subdomain,
or should I say, domain's part of 'www', mail, news and en should be
excluded.

For example, if the list was the following
------------------------------------------------------------
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com
------------------------------------------------------------

the sort's output would be
------------------------------------------------------------
http://google.com
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://mail.yahoo.com
------------------------------------------------------------

As you can see above, I don't want to
Thanks in advance.

How about sorting the strings as they are reversed?

urls = """\
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com""".split("\n")

sortedList = [ su[1] for su in sorted([ (u[::-1],u) for u in urls ]) ]

for url in sortedList:
print url
Prints:
http://mail.yahoo.co.uk
http://mail.google.com
http://reader.google.com
http://google.com
http://mail.yahoo.com
Close to what you are looking for, might be good enough?

-- Paul

Oct 2 '06 #10

Gentle reminder: is this homework? And you can expect better responses

if you show youve bootstrapped yourself on the problem to some extent.

Sure thing.
First I tried to solve this by using a list of domain found at
http://www.neuhaus.com/domaincheck/domain_list.htm

I converted this to a list (in python) and tried like below

look for url that endswith(domain in domains)
if found:
capture the left side of the domain part(tld) and
save all url to a dictionary that key is the captured string

to me this seems to work but stuck because this solution seems no good.

Oct 2 '06 #11

On 2 Oct 2006 08:56:09 -0700, be************@lycos.com
<be************@lycos.comwrote:

js:
All I want to do is to sort out a list of url by companyname,
like oreilly, ask, skype, amazon, google and so on, to find out
how many company's url the list contain.

Then if you can define a good enough list of such company names, you
can just do a search of such names inside each url.
Maybe you can use string method, or a RE, or create a big string with
all the company names and perform a longest common subsequence search
using the stdlib function.

well, I think list is so large that that's impossible to
create such a good company-list.

Oct 2 '06 #12

How about sorting the strings as they are reversed?

>
urls = """\
http://mail.google.com
http://reader.google.com
http://mail.yahoo.co.uk
http://google.com
http://mail.yahoo.com""".split("\n")

sortedList = [ su[1] for su in sorted([ (u[::-1],u) for u in urls ]) ]

for url in sortedList:
print url

<snip>

>
Close to what you are looking for, might be good enough?

Great... I couldn't thought that way. Thanks a lot!

Oct 2 '06 #13

by: Booser | last post by:

// Merge sort using circular linked list // By Jason Hall <booser108@yahoo.com> #include <stdio.h> #include <stdlib.h> #include <time.h> #include <math.h> //#define debug

C / C++

Domain Name Registrars (what do they own?)

by: Jordan | last post by:

Suppose I register a domain name, MyDomain.com, through register.com (or any other of the many domain name registrars). I own the domain name. The domain name goes into the global DNS database. ...

ASP.NET

Javascript can't delete server created cookie with no domain?

by: Wysiwyg | last post by:

After a server created cookie is processed on the client I want it removed, cleared, or expired in the javascript block but have been unable to do this. If I set a cookie value in the server code...

ASP.NET

Problem with SSL with a sub domain

by: Dany C. | last post by:

We have install a valid SSL certificate issued to www.mycompany.com on our web server running IIS 6.0 / win2003 SP1. Then we have created a sub domain pointing to the same server for our web...

.NET Framework

Getting the end of a domain name

by: Adam Tibi | last post by:

Hello, I want to get the right most name of the a domain name, for example: if the domain is www.myweb.com , I want to get myweb.com something.myweb.com --> myweb.com www.myweb.com.au --> ...

ASP.NET

Access Denied - How to set document.domain when writing to new window

by: johkar | last post by:

I am getting an Access denied error when I write to a new window. The situation and code are outlined below. I am setting the domain in the main window. The problem is that the window I am...

Javascript

How to sort list

by: Lad | last post by:

I have a list of emails and I would like to sorted that list by domains E.g. If the list is Emails= after sorting I would like to have Emails= What is the best/easiest way?

Python

Why you should use domain names such as example.com for obfuscation -- Guy Macon <http://www.GuyMacon.com/>

by: Guy Macon | last post by:

While I agree with the sentiment, the oringinal title on this thread ("OT: Specially for , why you should always use example.com for obfuscating domains") is wrong. There are other reserved domain...

HTML / CSS

domain name servers

by: yaghout | last post by:

Hi, i need to check and retrive the mx records for a domain using asp .net , i am using .net 1.1 and prefer to use this framework version . ok , i found that i should use DNS protocol to...

ASP.NET

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

One-click Importing Excel Data into a*Database

by: ryjfgjl | last post by:

In our work, we often need to import Excel data into databases (such as MySQL, SQL Server, Oracle) for data analysis and processing. Usually, we use database tools like Navicat or the Excel import...

Microsoft Excel

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Sort by domain name?

Similar topics