473,396 Members | 2,154 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Web Spider and Scraper

UJ
Does anybody have a code to 'scrape' the text off of a web page? I've got to
write a program that will go through some web pages and get the text to put
in a database.

TIA - Jeff.
Jan 21 '07 #1
6 2811
On Sun, 21 Jan 2007 19:14:45 +0100, UJ <fr**@nowhere.comwrote:
Does anybody have a code to 'scrape' the text off of a web page? I've
got to
write a program that will go through some web pages and get the text to
put
in a database.

TIA - Jeff.

No, but it shouldn't be too complicated.

Simple open an imput stream and fetch the url, then just use RegEx to
extract the text you need.

--
- Stefan Z Camilleri
- www.szc001.com
Jan 21 '07 #2
Also, if you have the cash take a look at webzinc.com.

"UJ" <fr**@nowhere.comwrote in message
news:On**************@TK2MSFTNGP04.phx.gbl...
Does anybody have a code to 'scrape' the text off of a web page? I've got
to write a program that will go through some web pages and get the text to
put in a database.

TIA - Jeff.

Jan 21 '07 #3
Before you start spending money on solutions, I'd suggest you download and
try out Simon Mourier's HtmlAgilityPack. Among other nice things, this will
turn any downloaded web page into an XPath - compliant DOM "HtmlDocument"
just like the .NET XmlDocument object, so that it is easy to XPath out
everything you need from the page.
Peter

--
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
Short urls & more: http://ittyurl.net


"UJ" wrote:
Does anybody have a code to 'scrape' the text off of a web page? I've got to
write a program that will go through some web pages and get the text to put
in a database.

TIA - Jeff.
Jan 21 '07 #4
For simple websites, regex is a good solution. But for complicated
websites, manual web extraction is a nightmare to maintain! Especially
if you need to click through several websites before you arrive at the
website that contains the data that you want to extract. There are
several tools for this: Kapow is good but very expensive, iMacros can
do the same, is easier to use and costs much less. Both, kapow and
imacros offer good support. We use imacros and are very happy with it.

Jon

Jan 22 '07 #5
UJ,

MSHTML is made for this, it can give you from every detailed part of a page
the content or whatever. It follows completely the HTML DOM model.

Cor
Jan 22 '07 #6
You can also try SWExplorerAutomation SWEA (http://webiussoft.com).
SWEA supports frames, popups, html and windows dialogs (alerts), AJAX.
With SWEA you can visually record script and generate C#/VB.NET code.
UJ wrote:
Does anybody have a code to 'scrape' the text off of a web page? I've got to
write a program that will go through some web pages and get the text to put
in a database.

TIA - Jeff.
Jan 22 '07 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Auction software | last post by:
Free download full version , all products http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups. Millions of valid...
3
by: Thomas Lindgaard | last post by:
Hello I'm a newcomer to the world of Python trying to write a web spider. I downloaded the skeleton from http://starship.python.net/crew/aahz/OSCON2001/ThreadPoolSpider.py Some of the...
3
by: Rock | last post by:
Hi, I started using a python based screen scraper called newsscraper I downloaded from sourceforge. http://sourceforge.net/projects/newsscraper/. I have created many python templates that work...
0
by: Auction software | last post by:
Free download full version , all products from Mewsoft dot com http://netauction8.url4life.com/ Groupawy --------------- Google Groups Email spider. The first email spider for google groups....
0
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...
1
by: swestenra | last post by:
I am trying to build a screen scraper. But not just a plain screen scraper, it must also automate the entry of data. Background: We have a new intranet system that goes in to production soon. ...
3
by: Tony Lance | last post by:
Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...
7
by: James Stroud | last post by:
Hello, Does anyone know of an example, however modest, of a screenscraper authored in python? I am using Firefox. Basically, I am answering problems via my browser and being scored for each...
2
by: =?Utf-8?B?Q2hhcnRz?= | last post by:
I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I don’t know how to use the program to click...
1
by: kronecker | last post by:
A screen scraper is a program that removes text only from a web site. I pinched this one from the web: Public Class Form1 Private Sub Form1_Load(ByVal sender As System.Object, _ ByVal e As...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.