469,282 Members | 1,945 Online
Bytes | Developer Community
New Post

Home Posts Topics Members FAQ

Post your question to a community of 469,282 developers. It's quick & easy.

How to find out un-referenced webpages,images and files in web pages directory tree ?

I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

Now I want to find out which of all these (many) files are un-reference orphans
starting from the main page index.html (or index.shtml)

In other words if e.g. a file aaa.log can not be referenced by a chain like

index.html -subpage8.html -details2345.html -aaa.log

Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)

Keep in mind that the static link URLs can be absolute (http://www.mywebpages.com/content/subpage8.html)
or relative (content/subpage8.html)

Pat
Dec 6 '07 #1
7 4311
On Dec 6, 11:15 am, pat...@hotmail.com (Patricia Mindanao) wrote:
I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

Now I want to find out which of all these (many) files are un-reference orphans
starting from the main page index.html (or index.shtml)

In other words if e.g. a file aaa.log can not be referenced by a chain like

index.html -subpage8.html -details2345.html -aaa.log

Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)

Keep in mind that the static link URLs can be absolute (http://www.mywebpages.com/content/subpage8.html)
or relative (content/subpage8.html)

Pat
You will have to write a small script, or look for one on the net,
that will:
1) select your starting file (index.html if you want to check which
pages are accessible/not accessible starting from your index page).
2) read selected file
3) scan the file, extract all links, calculate what they should be
referring to and check if the corresponding files exist, flag them as
used....
4) select each of the files from step 3, proceed to step 2) for each
of them
continue until there are no more files to check...
Dec 6 '07 #2
In article <47***********************@newsspool1.arcor-online.net>,
pa****@hotmail.com (Patricia Mindanao) wrote:
Is there a tool which help me to investigate all these un-referenced webpages
and files?
Of cause without doing a manual code review :-)
Use any basic spider (e.g., wget) followed by any recursive file
comparison (e.g., diff).

--
My personal UDP list: 127.0.0.1, 4ax.com, buzzardnews.com, googlegroups.com,
heapnode.com, localhost, ntli.net, teranews.com, vif.com, x-privat.org
Dec 6 '07 #3
On 06 Dec 2007 11:15:12 GMT, in comp.infosystems.www.authoring.html
pa****@hotmail.com (Patricia Mindanao) wrote:
>Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)
Xenu - http://home.snafu.de/tilman/xenulink.html

Also checks external (off-site) links.
--
William Hughes, San Antonio, Texas: cv****@grandecom.net
The Carrier Project: http://home.grandecom.net/~cvproj/carrier.htm
Support Project Valour-IT: http://soldiersangels.org/valour/index.html
Dec 7 '07 #4
On Dec 6, 4:15 am, pat...@hotmail.com (Patricia Mindanao) wrote:
I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.
On any linux server (most website are served this way, no....linux/
apache?)

############
create this file, call it "ifgrep"
If you don't put it in your current working
directory, make sure it is in your execution path.
chmod +x ifgrep <enter(do this to make it executable)

the ifgrep file:
#!/bin/sh

T=`grep $1 $2`
if [ "$T" ]; then
echo $2
fi
############
create this file, call it peeper:
chmod +x peeper
the peeper file:

#!/bin/sh

for x in `cat htmlfileExistsList`
do
file=`basename $x`
for lookee in `find . -type f -name "*html"`
do
look=`basename $file`
/home/sandy/bin/ifgrep $look $lookee;
done
done
###############

from the terminal prompt:
find . -name "*html" htmlfileExistsList
peeper htmlfileReferencedInAnHtmlFileList

################
Now you have two text files:
htmlfileExistsList and htmlfileReferencedInAnHtmlFileList
Any file names found in existsList not
found in foundINAFileList are orphaned html files

Dec 7 '07 #5
On Dec 7, 7:50 am, salmobytes <Sandy.Pittendr...@gmail.comwrote:
>
On any linux server (most website are served this way, no....linux/
apache?) ...apache and/or tomcat, that is.....
code snipped:
/home/sandy/bin/ifgrep $look $lookee;
You won't have installed ifgrep at this path, on your server.
But it is important to use the full path to the ifgrep
file (where ever it is) because, on most systems,
the shell won't have that file in its execution path,
.....so, use the full path to ifgrep in the peeper script.
Dec 7 '07 #6
On 12/6/2007 3:15 AM, Patricia Mindanao wrote:
I have a directory tree on my hard disc which represents all the web pages and linked stuff
on my mirrored web hoster server.

All web pages and files are statically linked. So dynamically composed links e.g.
with javascript do not matter here.

Now I want to find out which of all these (many) files are un-reference orphans
starting from the main page index.html (or index.shtml)

In other words if e.g. a file aaa.log can not be referenced by a chain like

index.html -subpage8.html -details2345.html -aaa.log

Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)

Keep in mind that the static link URLs can be absolute (http://www.mywebpages.com/content/subpage8.html)
or relative (content/subpage8.html)

Pat
Using your example, use a search tool (e.g., Search on Windows, grep on
UNIX) to search the directory for all files of the form *.html, first
for the string href="aaa.log" and second for the string
href="http://www.mywebpages.com/content/aaa.log". I do this often but
not often enough to create a search script.

--
David E. Ross
<http://www.rossde.com/>

Natural foods can be harmful: Look at all the
people who die of natural causes.
Dec 8 '07 #7
Patricia Mindanao wrote:
Is there a tool which help me to investigate all these un-referenced webpages and files?
Of cause without doing a manual code review :-)
linklint <URL:http://www.linklint.org/can determine orphans and
supports both local-file and HTTP site checking

--
Klaus Johannes Rusch
Kl********@atmedia.net
http://www.atmedia.net/KlausRusch/
Dec 17 '07 #8

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

3 posts views Thread by pascal Joseph | last post: by
3 posts views Thread by Jorge Gallardo | last post: by
1 post views Thread by Alex | last post: by
3 posts views Thread by nano9 | last post: by
4 posts views Thread by =?ISO-8859-15?Q?Albe_V=B0?= | last post: by
reply views Thread by Dennis | last post: by
reply views Thread by zhoujie | last post: by
By using this site, you agree to our Privacy Policy and Terms of Use.