473,324 Members | 2,567 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,324 software developers and data experts.

QUERY: comparing website contents

I've got two websites, one original, the other based off the original.

I like to diff/compare the websites using diff automatic comparison tools to see what text/information has changed. The problem is, the HTML code and layout has been changed drastically so I can't do a straight text file compare. What am interested in is purely the raw content (paragraphs, sentences, etc.). The original site has no javascript, onmouseover hovers, etc. The new revamped website has javascript, onmouseover hovers, popups, etc.

How can I create a script (Perl? C++?) so that it extracts the main text BODIEs from both sites? I guess also have to specify starting & ending delimiters. Once extracted, it would need to convert < p ></ p > paragraph tags, and strip out < a onmouseover... > anchor links (while maintaining the word inbetween the anchor link ofcourse). The new website uses two spaces after each full stop while the old website uses 1 space. Will this matter?

Once we got the plain text, how to wrap the paragraphs after 80 characters per line... so that we can easily do file compares.



And please do not suggest copying and pasting the text into NotePad or Word. I said 'website' which means they contain dozens of html files (probably 100s). Plus, I like a script to automate this compare process so I can repeat the process in future and remind myself of diffs....
Feb 15 '08 #1
4 1516
acoder
16,027 Expert Mod 8TB
How can I create a script (Perl? C++?)...
You said it yourself. If you know Perl, I can send this over to the Perl forum. JavaScript can't really do this.
Feb 16 '08 #2
I need a tool to get me the substring between delimiters then 79char

line wrap the result and then diff... for both oldsite/old1.htm and

newsite/new1.htm

As for web crawling, old site is local, new site is online. But I

rather hard code the URLs in a big list (mapping).

I think I'll use Perl (maybe Python), to:

1. for each item in mapping list
1.1 download newsite/html file
1.2 substring using newsite delimiters on newsite file
1.3 substring using oldsite delimiters on oldsite file
1.3 html2txt/hindent both oldsite & newsite file and line wrap 79char

and put into 2 separate new folders (diff1, diff2).
1.4 repeat through mapping list

After that I can use Beyond Compare to compare the diff1 & diff2

folders. Hopefully both corresponding text files will be 79char line

wrapped with whitespace down to 1 char (eliminating 2 or more

consecutive spaces, and tab spaces). Also maintain carriage returns?
Feb 18 '08 #3
rnd me
427 Expert 256MB
if you just want to compare visible contents (not html/js markup changes), i would think comparing the textContent/innerText of the body tag would be easiest.

i don't understand whay this could not be done in javascript, but perhaps i misunderstand your question.
Feb 18 '08 #4
acoder
16,027 Expert Mod 8TB
if you just want to compare visible contents (not html/js markup changes), i would think comparing the textContent/innerText of the body tag would be easiest.

i don't understand whay this could not be done in javascript, but perhaps i misunderstand your question.
Technically, it could be done in JavaScript, but with two domains, some server-side code will have to be involved.
Feb 18 '08 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

6
by: Wescotte | last post by:
I'm writing a tiny php app that will log into our bank of america account and retrieve a file containing a list of checks that cleared the previous day. The problem I'm running into is when I...
4
by: ddd | last post by:
I am trying to build a diff tool that allows me to compare two HTML files. I am looking for resources on how to achive this. The main problem is that I do not want to simply highlight the line of...
4
by: lasmit | last post by:
I am updating a C# web project which stores the contents of an ASP web form in an SQL Server 2000 database. Originally the code deleted all the current contents of the database and then...
1
by: Avi1 | last post by:
Hi, I got the code (from the internet)for comparing two files and showing the difference in contents.Now,I tried the same code for two files written in japanese language(kanji).If I save the two...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.