By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,923 Members | 1,443 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,923 IT Pros & Developers. It's quick & easy.

need start point for getting html info from web

P: n/a
hey there,

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?
thanks,
sk

Oct 31 '05 #1
Share this Question
Share on Google+
4 Replies


P: n/a
ne*****@xit.net writes:
i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?


Don't have a link to a howto. But you're halfway there. urllib (and
urllib2) will get HTML text from the websites. Pulling data from it
sort of depends on the nature of the HTML. If it's well-structured
XHTML, you can use your favorite xml library. if it's well structured
HTML, you can try htmllib, but it's pretty primitive. If it's not
well-structured, you can use BeautifulSoup. I've used it to pull data
from tables. The problem with any of this is that your code really
depends on the structure - or lack thereof - of the HTML you're
scraping. If they change it, your code breaks.

<mike
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Oct 31 '05 #2

P: n/a
yeah, i know i am going to have to write a bunch of stuff because the
values i want to get come from several different sites. ah-well, just
wanting to know the easiest way to learn how to get started. i will
check into beautiful soup, i think i have heard it referred to before.
thanks
shawn

Oct 31 '05 #3

P: n/a
<ne*****@xit.net> wrote in message
news:11*********************@g47g2000cwa.googlegro ups.com...
hey there,

i have a small app that i am going to need to get information from a
few tables on different websites. i have looked at urllib and httplib.
the sites i need to get data from mostly have this data in tables. So
that, i think would make it easier. Anyone suggest a good starting
point for me to find out how to do this, or know of a link to a good
how-to?
thanks,
sk

pyparsing comes with a simple HTML scraper example for extracting the NIST
NTP servers from an HTML table. pyparsing is also fairly tolerant of
"unclean" HTML. Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
Oct 31 '05 #4

P: n/a
You can easily do it with SW Explorer Automation
(http://home.comcast.net/~furmana/SWIEAutomation.htm).
The program creates an automation API for any Web application which
uses HTML and DHTML and works with Microsoft Internet Explorer. The Web
application becomes programmatically accessible from any .NET language.
The tool has Visual Table Data Extractor. It allows visually define the
table structure. The table becomes accessible from the code as
DataTable class. You can develop the extraction script in hours with
the tool.

Oct 31 '05 #5

This discussion thread is closed

Replies have been disabled for this discussion.