473,414 Members | 1,636 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,414 software developers and data experts.

how to strip the domain name in python?

Hi,

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):
s1, s2 = match[0]

print s2

but none of the site matched, can you please tell me what am i
missing?

Thank you.

Apr 14 '07 #1
8 6995
<Ma***********@gmail.comwrote:
Hi,

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):
s1, s2 = match[0]

print s2

but none of the site matched, can you please tell me what am i
missing?
You're using reverse slashes in your RE pattern, to start with, while
the URLs contain plain slashes (or don't have any slashes, in the case
of the second one).

Anyway, forget REs, and use standard library module urlparse,
specifically its urlparse.urlsplit function.
Alex
Apr 14 '07 #2

On Apr 13, 2007, at 11:49 PM, Ma***********@gmail.com wrote:
Hi,

I have a list of url names like this, and I am trying to strip out the
domain name using the following code:

http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk

pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):
s1, s2 = match[0]

print s2

but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)

Apr 14 '07 #3
On Apr 14, 12:02 am, Michael Bentley <mich...@jedimindworks.com>
wrote:
On Apr 13, 2007, at 11:49 PM, Marko.Cain...@gmail.com wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?

change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
Thanks. I try this:

but when the 'line' is http://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
match = re.findall(pattern, line)

if (match):

s1, s2 = match[0]

print s2

Apr 14 '07 #4
On Apr 14, 10:36 am, Marko.Cain...@gmail.com wrote:
On Apr 14, 12:02 am, Michael Bentley <mich...@jedimindworks.com>
wrote:
On Apr 13, 2007, at 11:49 PM, Marko.Cain...@gmail.com wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
>http://www.cnn.com
>www.yahoo.com
>http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)

Thanks. I try this:

but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)

match = re.findall(pattern, line)

if (match):

s1, s2 = match[0]

print s2
Can anyone please help me with my problem? I still can't solve it.

Basically, I want to strip out the text after the first '.' in url
address:

http://www.cnn.com -cnn.com

Apr 15 '07 #5
In <11*********************@y5g2000hsa.googlegroups.c om>, Marko.Cain.23
wrote:
On Apr 14, 10:36 am, Marko.Cain...@gmail.com wrote:
>On Apr 14, 12:02 am, Michael Bentley <mich...@jedimindworks.com>
wrote:
On Apr 13, 2007, at 11:49 PM, Marko.Cain...@gmail.com wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
>http://www.cnn.com
www.yahoo.com
http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)

Thanks. I try this:

but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?

pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)

match = re.findall(pattern, line)

if (match):

s1, s2 = match[0]

print s2

Can anyone please help me with my problem? I still can't solve it.

Basically, I want to strip out the text after the first '.' in url
address:

http://www.cnn.com -cnn.com
from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch
Apr 15 '07 #6
On Apr 15, 11:57 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
In <1176654669.737355.78...@y5g2000hsa.googlegroups.c om>, Marko.Cain.23
wrote:
On Apr 14, 10:36 am, Marko.Cain...@gmail.com wrote:
On Apr 14, 12:02 am, Michael Bentley <mich...@jedimindworks.com>
wrote:
On Apr 13, 2007, at 11:49 PM, Marko.Cain...@gmail.com wrote:
Hi,
I have a list of url names like this, and I am trying to strip out the
domain name using the following code:
>http://www.cnn.com
>www.yahoo.com
>http://www.ebay.co.uk
pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
but none of the site matched, can you please tell me what am i
missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
Thanks. I try this:
but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?
pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
Can anyone please help me with my problem? I still can't solve it.
Basically, I want to strip out the text after the first '.' in url
address:
http://www.cnn.com-cnn.com

from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch
Thanks for your help.

But if the input string is "http://www.ebay.co.uk/", I only get
"co.uk"

how can I change it so that it works for both www.ebay.co.uk and www.cnn.com?

Apr 15 '07 #7
Ma***********@gmail.com wrote:
On Apr 15, 11:57 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
>In <1176654669.737355.78...@y5g2000hsa.googlegroups.c om>, Marko.Cain.23
wrote:
>>On Apr 14, 10:36 am, Marko.Cain...@gmail.com wrote:
On Apr 14, 12:02 am, Michael Bentley <mich...@jedimindworks.com>
wrote:
On Apr 13, 2007, at 11:49 PM, Marko.Cain...@gmail.com wrote:
>Hi,
>I have a list of url names like this, and I am trying to strip out the
>domain name using the following code:
>http://www.cnn.com
>www.yahoo.com
>http://www.ebay.co.uk
>pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
>match = re.findall(pattern, line)
>if (match):
> s1, s2 = match[0]
> print s2
>but none of the site matched, can you please tell me what am i
>missing?
change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile("http:\/
\/(.*)\.(.*)", re.S)
Thanks. I try this:
but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?
pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
match = re.findall(pattern, line)
if (match):
s1, s2 = match[0]
print s2
Can anyone please help me with my problem? I still can't solve it.
Basically, I want to strip out the text after the first '.' in url
address:
http://www.cnn.com-cnn.com
from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch

Thanks for your help.

But if the input string is "http://www.ebay.co.uk/", I only get
"co.uk"

how can I change it so that it works for both www.ebay.co.uk and www.cnn.com?
>>def get_domain(url):
... net_location = urlsplit(url)[1]
... return net_location.split(".", 1)[1]
...
>>print get_domain('http://www.cnn.com')
cnn.com
>>print get_domain('http://www.ebay.co.uk')
ebay.co.uk
>>>
regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC/Ltd http://www.holdenweb.com
Skype: holdenweb http://del.icio.us/steve.holden
Recent Ramblings http://holdenweb.blogspot.com

Apr 16 '07 #8

On Apr 15, 2007, at 4:24 PM, Ma***********@gmail.com wrote:
On Apr 15, 11:57 am, Marc 'BlackJack' Rintsch <bj_...@gmx.netwrote:
>In <1176654669.737355.78...@y5g2000hsa.googlegroups.c om>,
Marko.Cain.23
wrote:
>>On Apr 14, 10:36 am, Marko.Cain...@gmail.com wrote:
On Apr 14, 12:02 am, Michael Bentley <mich...@jedimindworks.com>
wrote:
>>>>On Apr 13, 2007, at 11:49 PM, Marko.Cain...@gmail.com wrote:
>>>>>Hi,
>>>>>I have a list of url names like this, and I am trying to strip
>out the
>domain name using the following code:
>>>>>http://www.cnn.com
>www.yahoo.com
>http://www.ebay.co.uk
>>>>>pattern = re.compile("http:\\\\(.*)\.(.*)", re.S)
>match = re.findall(pattern, line)
>>>>>if (match):
> s1, s2 = match[0]
>>>>> print s2
>>>>>but none of the site matched, can you please tell me what am i
>missing?
>>>>change re.compile("http:\\\\(.*)\.(.*)", re.S) to re.compile
("http:\/
\/(.*)\.(.*)", re.S)
>>>Thanks. I try this:
>>>but when the 'line' ishttp://www.cnn.com, I get 's2' com,
but i want 'cnn.com' (everything after the first '.'), how can I do
that?
>>>pattern = re.compile("http:\/\/(.*)\.(.*)", re.S)
>>> match = re.findall(pattern, line)
>>> if (match):
>>> s1, s2 = match[0]
>>> print s2
>>Can anyone please help me with my problem? I still can't solve it.
>>Basically, I want to strip out the text after the first '.' in url
address:
>>http://www.cnn.com-cnn.com

from urlparse import urlsplit

def get_domain(url):
net_location = urlsplit(url)[1]
return '.'.join(net_location.rsplit('.', 2)[-2:])

def main():
print get_domain('http://www.cnn.com')

Ciao,
Marc 'BlackJack' Rintsch

Thanks for your help.

But if the input string is "http://www.ebay.co.uk/", I only get
"co.uk"

how can I change it so that it works for both www.ebay.co.uk and
www.cnn.com?
from urlparse import urlsplit

def get_domain(url):
net_location = (
urlsplit(url)[1]
and urlsplit(url)[1].split('.')
or urlsplit(url)[2].split('.')
) # tricksy way to get long line into email
if net_location[0].lower() == 'www':
net_location = net_location[1:]
return '.'.join(net_location)

def main():
testItems = ['http://www.cnn.com',
'www.yahoo.com',
'http://www.ebay.co.uk']

for testItem in testItems:
print get_domain(testItem)

if __name__ == '__main__':
main()
Apr 16 '07 #9

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

8
by: Fazer | last post by:
Hello, I was wondering what would be the easiest way to strip away HTML tags from a string? Or how would I remove everything between < and > also the < , > as well using regex? Thanks for...
6
by: Mark Miller | last post by:
I have a scheduled job that uses different XSL templates to transform XML and save it to disk. I am having problems with the code below. The problem shows up on both my development machine (Windows...
6
by: rtilley | last post by:
s = ' qazwsx ' # How are these different? print s.strip() print str.strip(s) Do string objects all have the attribute strip()? If so, why is str.strip() needed? Really, I'm just curious......
12
by: js | last post by:
Hi list, I have a list of URL and I want to sort that list by the domain name. Here, domain name doesn't contain subdomain, or should I say, domain's part of 'www', mail, news and en should be...
6
by: eight02645999 | last post by:
hi can someone explain strip() for these : 'example' when i did this: 'abcd,words.words'
7
by: Nick | last post by:
strip() isn't working as i expect, am i doing something wrong - Sample data in file in.txt: 'AF':'AFG':'004':'AFGHANISTAN':'Afghanistan' 'AL':'ALB':'008':'ALBANIA':'Albania'...
6
by: Christoph Zwerschke | last post by:
In Python programs, you will quite frequently find code like the following for removing a certain prefix from a string: if url.startswith('http://'): url = url Similarly for stripping...
10
by: Bobby Roberts | last post by:
hi group. I'm new to python and need some help and hope you can answer this question. I have a situation in my code where i need to create a file on the server and write to it. That's not a...
4
by: Poppy | last post by:
I'm using versions 2.5.2 and 2.5.1 of python and have encountered a potential bug. Not sure if I'm misunderstanding the usage of the strip function but here's my example. var = "detail.xml"...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.