By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
457,954 Members | 1,152 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 457,954 IT Pros & Developers. It's quick & easy.

not able to HTTPS page from python

P: n/a
Hi all,

Am trying to read a email ids which will be in the form of links ( on
which if we click, they will redirect to outlook with their respective
email ids).

And these links are in the HTTPS page, a secured http page.

The point is that am able to read some links with HTTP page, but am not
able to read the same when I try with HTTPS.

Using the following code from sgmllib am able to read the links,

class MyParser(sgmllib.SGMLParser):

def __init__(self):

sgmllib.SGMLParser.__init__(self)

self.inside_a = False

self.address = ''

def start_a(self,attrs):

if DEBUG:

print "start_a"

print attrs

for attr,value in attrs:

if attr == 'href' and value.startswith('mailto:'):

self.address = value[7:]

self.inside_a = True

def end_a(self):

if DEBUG:

print "end_a"

if self.address:

print '"%s" <%s>' % (self.nickname, self.address)

mailIdList.append(self.address)

self.inside_a = False

self.address = self.nickname = ''

def handle_data(self,data):

if self.inside_a:

self.nickname = data

And for the proxy authentication and the https handler am using the
following lines of code

authinfo = urllib2.HTTPBasicAuthHandler()

proxy_support = urllib2.ProxyHandler ({"http" :
"http://user:password@proxyname:port"})

opener = urllib2.build_opener(proxy_support, authinfo,
urllib2.HTTPSHandler)

urllib2.install_opener(opener)

Then am trying to call the parser for the links in a particular https
page which will be given as a command line argument. Which will read me
all the links in that page.

p = MyParser()

for ln in urllib2.urlopen( sys.argv[1] ):

p.feed(ln)

p.close()

NOTE : I have installed python with _ssl support also.

So with this code am able to read the links with HTTP page but not for
the HTTPS page.

AM NOT GETTING ANY ERRORS EITHER BUT ITS NOT READING THE LINKS, THAT
ARE PRESENT IN THE GIVEN HTTPS PAGE

Could you please tell me am I doing some thing wrong in the above code
for any of the handlers.

I have got struck here from so many days, please give me the solution
for this.

Thanks and regards

YOGI

Nov 9 '05 #1
Share this Question
Share on Google+
3 Replies


P: n/a
<mu*******@yahoo.com> wrote:
AM NOT GETTING ANY ERRORS EITHER BUT ITS NOT READING THE LINKS, THAT
ARE PRESENT IN THE GIVEN HTTPS PAGE


HAVE YOU TRIED ADDING A PRINT STATEMENT TO THE FEED LOOP SO
YOU CAN SEE WHAT YOU'RE GETTING BACK FROM THE SERVER ?

</f>

Nov 9 '05 #2

P: n/a
Fredrik Lundh wrote:
<mu*******@yahoo.com> wrote:

AM NOT GETTING ANY ERRORS EITHER BUT ITS NOT READING THE LINKS, THAT
ARE PRESENT IN THE GIVEN HTTPS PAGE

HAVE YOU TRIED ADDING A PRINT STATEMENT TO THE FEED LOOP SO
YOU CAN SEE WHAT YOU'RE GETTING BACK FROM THE SERVER ?

COULD YOU GUYS BE QUIET, PLEASE, I'M TRYING TO WORK HERE!

regards
Steve
--
Steve Holden +44 150 684 7255 +1 800 494 3119
Holden Web LLC www.holdenweb.com
PyCon TX 2006 www.python.org/pycon/

Nov 9 '05 #3

P: n/a
It is possible that the links have been obscured (something
I do on my own web pages) by inserting Javascript that creates
the links on the fly using document.write(). That way web
spiders can't go through the web pages and easily pick up email
addresses to send spam to all my employees. Just a thought
since you have spent days on this.

-Larry Bates
mu*******@yahoo.com wrote:
Hi all,

Am trying to read a email ids which will be in the form of links ( on
which if we click, they will redirect to outlook with their respective
email ids).

And these links are in the HTTPS page, a secured http page.

The point is that am able to read some links with HTTP page, but am not
able to read the same when I try with HTTPS.

Using the following code from sgmllib am able to read the links,

class MyParser(sgmllib.SGMLParser):

def __init__(self):

sgmllib.SGMLParser.__init__(self)

self.inside_a = False

self.address = ''

def start_a(self,attrs):

if DEBUG:

print "start_a"

print attrs

for attr,value in attrs:

if attr == 'href' and value.startswith('mailto:'):

self.address = value[7:]

self.inside_a = True

def end_a(self):

if DEBUG:

print "end_a"

if self.address:

print '"%s" <%s>' % (self.nickname, self.address)

mailIdList.append(self.address)

self.inside_a = False

self.address = self.nickname = ''

def handle_data(self,data):

if self.inside_a:

self.nickname = data

And for the proxy authentication and the https handler am using the
following lines of code

authinfo = urllib2.HTTPBasicAuthHandler()

proxy_support = urllib2.ProxyHandler ({"http" :
"http://user:password@proxyname:port"})

opener = urllib2.build_opener(proxy_support, authinfo,
urllib2.HTTPSHandler)

urllib2.install_opener(opener)

Then am trying to call the parser for the links in a particular https
page which will be given as a command line argument. Which will read me
all the links in that page.

p = MyParser()

for ln in urllib2.urlopen( sys.argv[1] ):

p.feed(ln)

p.close()

NOTE : I have installed python with _ssl support also.

So with this code am able to read the links with HTTP page but not for
the HTTPS page.

AM NOT GETTING ANY ERRORS EITHER BUT ITS NOT READING THE LINKS, THAT
ARE PRESENT IN THE GIVEN HTTPS PAGE

Could you please tell me am I doing some thing wrong in the above code
for any of the handlers.

I have got struck here from so many days, please give me the solution
for this.

Thanks and regards

YOGI

Nov 9 '05 #4

This discussion thread is closed

Replies have been disabled for this discussion.