471,348 Members | 1,950 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,348 software developers and data experts.

web page text extractor

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

Jul 12 '07 #1
12 4073
Hello jk,
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
Going simple :)

from os import system
from sys import argv

OUTFILE = "geturl.txt"
system("lynx -dump %s %s" % (argv[1], OUTFILE))
system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)

Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.

HTH,
--
Miki <mi*********@gmail.com>
http://pythonwise.blogspot.com

Jul 12 '07 #2
On 2007-07-12 04:42:25 -0500, kublai <re*******@gmail.comsaid:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/BeautifulSoup/
http://www.holovaty.com/blog/archive/2007/07/06/0128

Jul 12 '07 #3
2007/7/12, kublai <re*******@gmail.com>:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

--
Andre Engels, an*********@gmail.com
ICQ: 6260644 -- Skype: a_engels
Jul 12 '07 #4
2007/7/12, Andre Engels <an*********@gmail.com>:

I forgot to include

import urllib2, re

here
def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

--
Andre Engels, an*********@gmail.com
ICQ: 6260644 -- Skype: a_engels
Jul 12 '07 #5
On Jul 12, 5:24 pm, "Andre Engels" <andreeng...@gmail.comwrote:
2007/7/12, Andre Engels <andreeng...@gmail.com>:

I forgot to include

import urllib2, re

here
def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

--
Andre Engels, andreeng...@gmail.com
ICQ: 6260644 -- Skype: a_engels
Andre I think that unfortunately your solution will not ignore inlined
scripting, inlined styling, etc.
On the otherside, I don't think there are many solutions available,
other than the Lynx approach somebody
has already suggested.

bests,
../alex
--
..w( the_mindstorm )p.
Jul 12 '07 #6
On Jul 12, 10:22 pm, Jon Rosebaugh <j...@turnthepage.orgwrote:
On 2007-07-12 04:42:25 -0500, kublai <restyc...@gmail.comsaid:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/Beaut...007/07/06/0128
Thanks all for your suggestions. I will try first the Lynx solution.

Cheers,
gk

Jul 12 '07 #7
kublai wrote:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
Super-simplistic:
>>import lxml.etree as et
parser = et.HTMLParser()
tree = et.parse("http://the/page.html", parser)
print tree.xpath("string(/html/body)")
http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan
Jul 12 '07 #8
On Jul 13, 2:19 am, Stefan Behnel <stefan.behnel-n05...@web.dewrote:
kublai wrote:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Super-simplistic:
>>import lxml.etree as et
>>parser = et.HTMLParser()
>>tree = et.parse("http://the/page.html", parser)
>>print tree.xpath("string(/html/body)")

http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan
Hi, Stefan,
This looks very interesting. I will look into this first thing
tonight. Gotta hit some golf bugs, I mean, balls first. It's a
beautiful afternoon here in Edmonton.
Cheers,
gk

Jul 12 '07 #9
On Jul 12, 4:42 am, kublai <restyc...@gmail.comwrote:
Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk
One of the examples provided with pyparsing is an HTML stripper - view
it online at http://pyparsing.wikispaces.com/spac...tmlStripper.py.

-- Paul

Jul 13 '07 #10
On Jul 13, 5:44 pm, Paul McGuire <pt...@austin.rr.comwrote:
On Jul 12, 4:42 am, kublai <restyc...@gmail.comwrote:
Hello,
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
Thanks,
gk

One of the examples provided with pyparsing is an HTML stripper - view
it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul
Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk

Jul 13 '07 #11
To maintain paragraphs, replace any p or br tags with your favorite
operating system's crlf.

On Jul 13, 8:57 am, kublai <restyc...@gmail.comwrote:
On Jul 13, 5:44 pm, Paul McGuire <pt...@austin.rr.comwrote:
On Jul 12, 4:42 am, kublai <restyc...@gmail.comwrote:
Hello,
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?
Thanks,
gk
One of the examples provided with pyparsing is an HTML stripper - view
it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.
-- Paul

Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk

Jul 13 '07 #12
Miki <mi*********@gmail.comwrote:
(You can find lynx at http://lynx.browser.org/)
not exactly -

The current version of lynx is 2.8.6

It's available at
http://lynx.isc.org/lynx2.8.6/
2.8.7 Development & patches:
http://lynx.isc.org/current/index.html

--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net
Jul 22 '07 #13

This discussion thread is closed

Replies have been disabled for this discussion.

Similar topics

4 posts views Thread by cstudent79 | last post: by
reply views Thread by Vijay | last post: by
3 posts views Thread by rahman | last post: by
25 posts views Thread by electrixnow | last post: by
16 posts views Thread by Preben Randhol | last post: by
8 posts views Thread by tubby | last post: by
1 post views Thread by Ronak mishra | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.