473,320 Members | 2,117 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,320 software developers and data experts.

Pulling stuff out of the internet archive (wayback machine)

I used this when trying to retrieve the McMillan site. Others might find
it useful...

David

#!/usr/bin/env python

import urlparse
import urllib2
import os
import HTMLParser
import sre

class HTMLLinkScanner(HTMLParser.HTMLParser):
tags = {'a':'href','img':'src','frame':'src','base':'href '}

def reset(self):
self.links = {}
self.replacements = []
HTMLParser.HTMLParser.reset(self)

def handle_starttag(self, tag, attrs):
if tag in self.tags:
checkattrs = self.tags[tag]
if isinstance(checkattrs, (str, unicode)):
checkattrs = [checkattrs]
for attr, value in attrs:
if attr in checkattrs:
if tag != 'base':
link = urlparse.urldefrag(value)[0]
self.links[link] = True
self.replacements.append((self.get_starttag_text() , attr, value))

class MirrorRetriever:
def __init__(self, archivedir):
self.archivedir = archivedir
self.urlmap = {}

def url2filename(self, url):
scheme, location, path, query, fragment = urlparse.urlsplit(url)
if not path or path.endswith('/'):
path += 'index.html'
path = os.path.join(*path.split('/'))
if scheme.lower() != 'http':
location = os.path.join(scheme, location)
# ignore query for the meantime
return os.path.join(self.archivedir, location, path)

def testinclude(self, url):
scheme, location, path, query, fragment = urlparse.urlsplit(url)
if scheme in ('mailto', 'javascript'): return False
# TODO: add ability to specify site
# return location.lower() == 'www.mcmillan-inc.com'
return True

def ensuredir(self, pathname):
if not os.path.isdir(pathname):
self.ensuredir(os.path.dirname(pathname))
os.mkdir(pathname)

def retrieveurl(self, url):
return urllib2.urlopen(url).read()

def mirror(self, url):
if url in self.urlmap:
return
else:
filename = self.url2filename(url)
if not self.testinclude(url):
return
print url,'->',filename
self.urlmap[url] = filename
# TODO: add an option about re-reading stuff
if os.path.isfile(filename):
contents = open(filename, 'r').read()
else:
try:
contents = self.retrieveurl(url)
except urllib2.URLError, e:
print 'could not retrieve url %s: %s' % (url, e)
return
self.ensuredir(os.path.dirname(filename))
linkscanner = HTMLLinkScanner()
try:
linkscanner.feed(contents)
except:
print 'could not parse %s as html' % url
linkstomirror = []
for link in linkscanner.links:
linkurl = urlparse.urljoin(url, link)
linkstomirror.append(linkurl)
contents = sre.sub('http://web.archive.org/web/[0-9]{14}/', '', contents)
for tagtext, attr, link in linkscanner.replacements:
scheme, location, path, query, fragment = urlparse.urlsplit(link)
newtext = None
if tagtext.lower().startswith('<base'):
# strip out base references
newtext = ''
elif scheme or location:
if not self.testinclude(link): continue
linkfilename = self.url2filename(link)
newtext = tagtext.replace(link, 'file://%s' % linkfilename)
elif path.startswith('/'):
linkurl = urlparse.urljoin(url, link)
linkfilename = self.url2filename(linkurl)
newtext = tagtext.replace(link, 'file://%s' % linkfilename)
if newtext is not None:
contents = contents.replace(tagtext, newtext)
contentsfile = open(filename, 'w')
contentsfile.write(contents)
contentsfile.close()
for linkurl in linkstomirror:
self.mirror(linkurl)

class WaybackRetriever(MirrorRetriever):
def __init__(self, archivedir, datestring):
MirrorRetriever.__init__(self, archivedir)
self.datestring = datestring

def retrieveurl(self, url):
waybackurl = 'http://web.archive.org/web/%s/%s' % (self.datestring, url)
contents = urllib2.urlopen(waybackurl).read()
if contents.find("Sorry, we can't find the archived version of this page") != -1:
raise urllib2.URLError("not in wayback archive")
# remove the copyrighted javascript from the wayback machine...
contents = sre.sub('\\<SCRIPT language="Javascript"\\>(.|\r|\n)*(// FILE ARCHIVED ON [0-9]{14} AND RETRIEVED(.|\r|\n)* ON [0-9]{14}[.])(.|\r|\n)*\\</SCRIPT\\>', '\\2', contents)
# replace the javascript-style comments indicating the retrieval with html comments
contents = sre.sub('// ((FILE|INTERNET).*)', '<!-- \\1 -->', contents)
return contents

if __name__ == '__main__':
import sys
m = WaybackRetriever(os.path.abspath('.'), sys.argv[2])
m.mirror(sys.argv[1])
Jul 18 '05 #1
1 2707
David Fraser <da****@sjsoft.com> writes:
I used this when trying to retrieve the McMillan site. Others might
find it useful...


Cool, thanks. I've done stuff like that a bunch of times, but I
usually just examine the HTML manually and identify a few fixed
strings to search for, to locate the links I want.
Jul 18 '05 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Stefan Mueller | last post by:
I've just installed phpMyAdmin. If I have the line $cfg = 'http://localhost/php/phpmyadmin'; in my config.inc.php I can login to phpMyAdmin without any problems from my local machine. But it's...
0
by: gcash | last post by:
OK, I was writing code using WIN32ALL to do automation scripts for common things we do in IE a lot at work, since grunt-n-point gets very boring after a while. I snarfed the code from ...
72
by: Mel | last post by:
Are we going backwards ? (please excuse my spelling...) In my opinion an absolute YES ! Take a look at what we are doing ! we create TAGS, things like <H1> etc. and although there are tools...
9
by: Tom Dacon | last post by:
I have a little desktop application (it happens to be a Windows Forms analog clock) that's written in VB.Net (2003). I placed a shortcut in my Startup program group to start it up when I log on. I...
2
by: bbxrider | last post by:
win2k adv server/ iis5.0/vb6.0/ado and/or odbc connections on client machine i have an mdb on win2k adv server machine and want internet read/write to it from both a non-windows, red hat, webserver...
9
by: -Lost | last post by:
http://blogs.msdn.com/ie/archive/2007/12/19/internet-explorer-8-and- acid2-a-milestone.aspx Oh my! A somewhat standards compliant Internet Explorer? What about JavaScript? Not that it proves...
3
by: BobRoyAce | last post by:
I am using Visual Studio 2008 w/ VB.NET. For the database, I am using SQL Server 2005, which is running on a dedicated server box. I am creating a WinForms application for a client. It is run...
4
by: fbrewster | last post by:
I'm writing an HTML parser and would like to use Internet Explorers DOM parser. Can I use Internet Explorers DOM parser through a web service? thanks for the help
12
by: Alexnb | last post by:
This is similar to my last post, but a little different. Here is what I would like to do. Lets say I have a text file. The contents look like this, only there is A LOT of the same thing. () A...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: ArrayDB | last post by:
The error message I've encountered is; ERROR:root:Error generating model response: exception: access violation writing 0x0000000000005140, which seems to be indicative of an access violation...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.