473,511 Members | 15,384 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

need start point for getting html info from web

hey there,

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?
thanks,
sk

Oct 31 '05 #1
4 1404
ne*****@xit.net writes:
i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?


Don't have a link to a howto. But you're halfway there. urllib (and
urllib2) will get HTML text from the websites. Pulling data from it
sort of depends on the nature of the HTML. If it's well-structured
XHTML, you can use your favorite xml library. if it's well structured
HTML, you can try htmllib, but it's pretty primitive. If it's not
well-structured, you can use BeautifulSoup. I've used it to pull data
from tables. The problem with any of this is that your code really
depends on the structure - or lack thereof - of the HTML you're
scraping. If they change it, your code breaks.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 31 '05 #2
yeah, i know i am going to have to write a bunch of stuff because the
values i want to get come from several different sites. ah-well, just
wanting to know the easiest way to learn how to get started. i will
check into beautiful soup, i think i have heard it referred to before.
thanks
shawn

Oct 31 '05 #3
<ne*****@xit.net> wrote in message
news:11*********************@g47g2000cwa.googlegro ups.com...
hey there,

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?
thanks,
sk

pyparsing comes with a simple HTML scraper example for extracting the NIST
NTP servers from an HTML table. pyparsing is also fairly tolerant of
"unclean" HTML. Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
Oct 31 '05 #4
You can easily do it with SW Explorer Automation
(http://home.comcast.net/~furmana/SWIEAutomation.htm).
The program creates an automation API for any Web application which
uses HTML and DHTML and works with Microsoft Internet Explorer. The Web
application becomes programmatically accessible from any .NET language.
The tool has Visual Table Data Extractor. It allows visually define the
table structure. The table becomes accessible from the code as
DataTable class. You can develop the extraction script in hours with
the tool.

Oct 31 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
9857
by: James Hong | last post by:
Help please, I try to sending an email from my html page using the java applet. but it give error on most of the PC only very few work, what is the error i make the java applet show as below ...
10
2416
by: guido.baumhoff | last post by:
Hello everybody, I am very confused: an last saturday my client has installed a new notebook system with the components, I have wish: Microsoft Windows XP, IIS 5.1, SQL MSDE with SP4, .NET...
15
4564
by: Cheryl Langdon | last post by:
Hello everyone, This is my first attempt at getting help in this manner. Please forgive me if this is an inappropriate request. I suddenly find myself in urgent need of instruction on how to...
7
2516
by: fox | last post by:
Hi, Lacking javascript knowledge, I just realized why my project has a bug. I am using ASP to loop through a set of records while it creates URLs with a querystring that has a single value pair....
12
3856
by: gcary | last post by:
I am having trouble figuring out how to declare a pointer to an array of structures and initializing the pointer with a value. I've looked at older posts in this group, and tried a solution that...
3
5602
by: sunbeam | last post by:
Short Description of the Project: we developed a e-learning system for our students. each student has a unique username/password to view the modules he/she should view and nothing more. since we...
20
4222
by: mike | last post by:
I help manage a large web site, one that has over 600 html pages... It's a reference site for ham radio folks and as an example, one page indexes over 1.8 gb of on-line PDF documents. The site...
15
2223
RMWChaos
by: RMWChaos | last post by:
In my ongoing effort to produce shorter, more efficient code, I have created a "chicken and egg" / "catch-22" problem. I can think of several ways to fix this, none of them elegant. I want my code...
6
2964
by: zaina | last post by:
hi everybody i am nwebie in this forum but i think it is useful for me and the member are helpful my project is about connecting client with the server to start exchanging messages between...
0
7242
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
7353
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
7418
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
7075
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
7508
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
1
5063
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
4737
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and...
0
3212
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
0
446
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.