web page text extractor

kublai

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

Jul 12 '07 #1

Subscribe Reply

4124

Miki

Hello jk,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Going simple :)

from os import system
from sys import argv

OUTFILE = "geturl.txt"
system("lynx -dump %s %s" % (argv[1], OUTFILE))
system("start notepad %s" % OUTFILE)
(You can find lynx at http://lynx.browser.org/)

Note the removing sidebars is a very difficult problem.
Search for "wrapper induction" to see some work on the subject.

HTH,
--
Miki <mi*********@gmail.com>
http://pythonwise.blogspot.com

Jul 12 '07 #2

Jon Rosebaugh

On 2007-07-12 04:42:25 -0500, kublai <re*******@gmail.comsaid:

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/BeautifulSoup/
http://www.holovaty.com/blog/archive/2007/07/06/0128

Jul 12 '07 #3

Andre Engels

2007/7/12, kublai <re*******@gmail.com>:

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

--
Andre Engels, an*********@gmail.com
ICQ: 6260644 -- Skype: a_engels

Jul 12 '07 #4

Andre Engels

2007/7/12, Andre Engels <an*********@gmail.com>:

I forgot to include

import urllib2, re

here

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

--
Andre Engels, an*********@gmail.com
ICQ: 6260644 -- Skype: a_engels

Jul 12 '07 #5

Alex Popescu

On Jul 12, 5:24 pm, "Andre Engels" <andreeng...@gmail.comwrote:

2007/7/12, Andre Engels <andreeng...@gmail.com>:

I forgot to include

import urllib2, re

here

def textonly(url):
# Get the HTML source on url and give only the main text
f = urllib2.urlopen(url)
text = f.read()
r = re.compile('\<[^\<\>]*\>')
newtext = r.sub('',text)
while newtext != text:
text = newtext
newtext = r.sub('',text)
return text

--
Andre Engels, andreeng...@gmail.com
ICQ: 6260644 -- Skype: a_engels

Andre I think that unfortunately your solution will not ignore inlined
scripting, inlined styling, etc.
On the otherside, I don't think there are many solutions available,
other than the Lynx approach somebody
has already suggested.

bests,
../alex
--
..w( the_mindstorm )p.

Jul 12 '07 #6

kublai

On Jul 12, 10:22 pm, Jon Rosebaugh <j...@turnthepage.orgwrote:

On 2007-07-12 04:42:25 -0500, kublai <restyc...@gmail.comsaid:

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

You may find BeautifulSoup or templatemaker to be of assistance:

http://www.crummy.com/software/Beaut...007/07/06/0128

Thanks all for your suggestions. I will try first the Lynx solution.

Cheers,
gk

Jul 12 '07 #7

Stefan Behnel

kublai wrote:

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Super-simplistic:

>>import lxml.etree as et
parser = et.HTMLParser()
tree = et.parse("http://the/page.html", parser)
print tree.xpath("string(/html/body)")

http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan

Jul 12 '07 #8

kublai

On Jul 13, 2:19 am, Stefan Behnel <stefan.behnel-n05...@web.dewrote:

kublai wrote:
For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Super-simplistic:

>>import lxml.etree as et
>>parser = et.HTMLParser()
>>tree = et.parse("http://the/page.html", parser)
>>print tree.xpath("string(/html/body)")

http://codespeak.net/lxml/

You may want to use the incredibly versatile "lxml.html.clean" module first to
remove any annoying content. It's not released yet but available in a branch:

http://codespeak.net/svn/lxml/branch/html/

Stefan

Hi, Stefan,
This looks very interesting. I will look into this first thing
tonight. Gotta hit some golf bugs, I mean, balls first. It's a
beautiful afternoon here in Edmonton.
Cheers,
gk

Jul 12 '07 #9

Paul McGuire

On Jul 12, 4:42 am, kublai <restyc...@gmail.comwrote:

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

One of the examples provided with pyparsing is an HTML stripper - view
it online at http://pyparsing.wikispaces.com/spac...tmlStripper.py.

-- Paul

Jul 13 '07 #10

kublai

On Jul 13, 5:44 pm, Paul McGuire <pt...@austin.rr.comwrote:

On Jul 12, 4:42 am, kublai <restyc...@gmail.comwrote:

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

One of the examples provided with pyparsing is an HTML stripper - view
it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul

Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk

Jul 13 '07 #11

rdahlstrom

To maintain paragraphs, replace any p or br tags with your favorite
operating system's crlf.

On Jul 13, 8:57 am, kublai <restyc...@gmail.comwrote:

On Jul 13, 5:44 pm, Paul McGuire <pt...@austin.rr.comwrote:

On Jul 12, 4:42 am, kublai <restyc...@gmail.comwrote:

Hello,

For a project, I need to develop a corpus of online news stories. I'm
looking for an application that, given the url of a web page, "copies"
the rendered text of the web page (not the source HTNL text), opens a
text editor (Notepad), and displays the copied text for the user to
examine and save into a text file. Graphics and sidebars to be
ignored. The examples I have come across are much too complex for me
to customize for this simple job. Can anyone lead me to the right
direction?

Thanks,
gk

One of the examples provided with pyparsing is an HTML stripper - view
it online athttp://pyparsing.wikispaces.com/space/showimage/htmlStripper.py.

-- Paul

Stripping tags is indeed one strategy that came to mind. I'm wondering
how much information (for example, paragraphing) would be lost, and if
what would be lost would be acceptable (to the project). I looked at
pyparsing and I see that it's got a lot of text processing
capabilities that I can use along the way. I sure will try it. Thanks
for the post.

Best,
gk

Jul 13 '07 #12

Thomas Dickey

Miki <mi*********@gmail.comwrote:

(You can find lynx at http://lynx.browser.org/)

not exactly -

The current version of lynx is 2.8.6

It's available at
http://lynx.isc.org/lynx2.8.6/
2.8.7 Development & patches:
http://lynx.isc.org/current/index.html

--
Thomas E. Dickey
http://invisible-island.net
ftp://invisible-island.net

Jul 22 '07 #13

Similar topics

powerpoint text extractor... Help!

by: cstudent79 | last post by:

Hello folks,how do u do ? I want to develop an application that can extract text from a powerpoint presentation.But i am in dark about the powerpoint file format.I would be obliged if somebody can...

C / C++

OK, where'd you hide my freekin' Regex html -> text extractor?

by: _BNC | last post by:

I've been looking for a couple weeks for a regex expression that will extract text from html in a form that will look like IE screen output. I'm sure one of you guys hid it somewhere as a joke, but...

C# / C Sharp

Web Extractor

by: Vijay | last post by:

h any know how the website extractor tool works thanks & regard Vijay

ASP.NET

Help parsing continuous text from the html parsed page.

by: Martin Ho | last post by:

I've got this problem, where I need to extract an articles from many different news sources (webpages). I need to write some logic which would know how to extract the text only and not a garbage...

Visual Basic .NET

Can inserter/extractor operator overrides be made run-time polymorphic?

by: Randy | last post by:

Since these operators can't be member functions, and since friend functions can't be declared virtual, how do I make my inserters and extractors polymorphic? --Randy Yates

C / C++

Need to extract portion of HTML page...

by: rahman | last post by:

I have few hundred HTML pages. I need to extract portion of each HTML page into a text/database/HTML files format. You can imagine it is very tedious to do one by one. Is there any automatic...

HTML / CSS

parse comma delimited text string

by: electrixnow | last post by:

in MS VC++ Express I need to know how to get from one comma delimited text string to many strings. from this: main_string = "onE,Two,Three , fouR,five, six " to these: string1 =...

C / C++

Extracting values from text file

by: Preben Randhol | last post by:

Hi A short newbie question. I would like to extract some values from a given text file directly into python variables. Can this be done simply by either standard library or other libraries? Some...

Python

pdf to text

by: tubby | last post by:

I know this question comes up a lot, so here goes again. I want to read text from a PDF file, run re searches on the text, etc. I do not care about layout, fonts, borders, etc. I just want the...

Python

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...

C# / C Sharp

Trying to create a lan-to-lan vpn between two differents networks

by: TSSRALBI | last post by:

Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The...

Networking - Hardware / Configuration

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET