473,399 Members | 4,177 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,399 software developers and data experts.

Access to database other web sites

I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.

I am studying python these days because I thinks it is a good language
for the work. Actually I am a novice at python.

I welcome any informaion about this problem. Thanks in advance.
Jul 18 '05 #1
7 1829
ti****@hanmail.net (Jenny) writes:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.
IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).

I am studying python these days because I thinks it is a good language
for the work. [...]

I think so too.

I welcome any informaion about this problem. Thanks in advance.


In the standard library, you'll want to look at these modules: httplib
(low level HTTP -- you probably don't want to use this), urllib2
(opens URLs as if they were files, handles redirections, proxies
etc. for you) and HTMLParser. The standard library also includes
sgmllib & htmllib, but you'll probably want to use HTMLParser instead
if you want that kind of event-driven parsing at all. Regular
expressions (re module) can also come in handy.

Personally, I've decided that I prefer the DOM style of parsing for
anything complicated -- it's just less work than the event-driven
style (though I don't much like the DOM API). PyXML has an HTML DOM
implementation called 4DOM. Use that together with mxTidy or
uTidylib: they will clean up the horrid HTML you'll find on the web to
the point where 4DOM can make sense of it. Another option is to use
mxTidy/uTidylib to output XHTML, which allows you to use any XML DOM
implementation -- eg. pxdom, minidom, libxml...

You might find my modules useful too. ClientCookie has an interface
just like urllib2 (and uses it to do its work), but handles cookies
and some other stuff too. ClientForm makes it easier to work with
HTML forms. ClientTable is currently a heap of junk, don't use it ;-)
I've just rewritten ClientForm on top of the DOM, which lets you
switch back and forth between the two APIs (and also lets you handle
JavaScript, rather badly ATM) -- coming RSN...

http://wwwsearch.sourceforge.net/
The other, completely different, way of web scraping is to use the
"automation" capabilities of the various big web browsers: Microsoft
Internet Explorer, KDE's Konqueror and Mozilla are all scriptable from
Python. You need the Python for Windows extensions, PyKDE or PyXPCOM
respectively to control those browsers. Advantages: easy handling of
JavaScript and other assorted nonsense, and they're generally
reasonably well-tested and stable pieces of software (not to mention
de-facto standards). Disadvantages: poor portability in some cases,
and they're rather big, complicated, closed applications that are hard
to modify (compared to the pure Python approach) and to distribute
(which last, I guess, isn't a problem for you, since you'll be the
only one using your software). Other problems: COM (for MSIE) is a
bit of a headache for newbies, PyXPCOM last time I looked seemed a
pain to install (Brendan Eich mentioned in a newsgroup post that that
has been changing recently, though), and PyKDE might not be that well
tested (it's a very big wrapper!).

One other bunch of software worthy of mention: you can use Jython to
access various Java libraries. HTTPClient and httpunit look like they
might be useful. In particular, the latter has some JavaScript
support.
John
Jul 18 '05 #2
jj*@pobox.com (John J. Lee) writes:
ti****@hanmail.net (Jenny) writes:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this

[...]

Forgot to say: if you don't already know, Google Groups can be worth
its weight in round tuits. Try some searches there, in
comp.lang.python, on the stuff I mentioned.
John
Jul 18 '05 #3
| IIUYC, what you're contemplating is called "web scraping"
| ....

John ....

I did a bit of web scraping over the past week end
for a friend that is interested in Lotto numbers ....

The Lotto numbers were readily available on the web
and presented as well-formed and readable HTML tables ....

The primary problem I found up front was to be able
parse and transform this data into something
that Python, or any other language, might be able
to cope with for subsequent analysis ....

Since the number of records that I was dealing with
in this case was relatively small, only a couple of thousand,
I could manage the initial data transformations
using my genetically encoded EyeBall parser,
a text editor, and a couple of one-off Python scripts ....

The first step in each case for the source files
was using HTML Tidy to ...

"clean up the horrid HTML you'll find on the web "

I'd like to empashize for the benefit of the original poster
that the initial data parsing will probably entail a fair amount
of non-trivial work and that the subsequent data analysis
and reporting will seem almost trivial by comparison ....

Thanks for posting the info regarding different approaches,
as I think it will be useful for me when I get around
to replacing my EyeBall parser with something more effective ....

--
Cousin Stanley
Human Being
Phoenix, Arizona

Jul 18 '05 #4
In article <87************@pobox.com>, John J. Lee <jj*@pobox.com> wrote:
ti****@hanmail.net (Jenny) writes:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.


IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).

I am studying python these days because I thinks it is a good language
for the work.

[...]

I think so too.

Jul 18 '05 #5
cl****@lairds.com (Cameron Laird) writes:
In article <87************@pobox.com>, John J. Lee <jj*@pobox.com> wrote:
ti****@hanmail.net (Jenny) writes:
[...] I'm ... reserved about the prospects for the proposed research. The commercial
sites you want to study are, in my experience, some of the most difficult to
"scrape".
Which (ATM, anyway) is a good reason for doing it with browser automation.

Complementing that difficulty is the poverty of inference I antici-
pate you'll be able to ground on what you find there; their commerce has a lot
more noise than signal, as I see it.
What do you mean 'their commerce has more noise than signal'?

'Twould be great, though, for you to
uncover something real. Good luck.


What I was wondering was where the sales data are going to come from.
John
Jul 18 '05 #6
In article <87************@pobox.com>, John J. Lee <jj*@pobox.com> wrote:
Jul 18 '05 #7
cl****@lairds.com (Cameron Laird) writes:
[...John wrote:]
What I was wondering was where the sales data are going to come from.

.
.
.
That's a typical part. As I understand Jenny, she's going
to look at, say, eBay, and correlate "sales" with "price"
and "marketing" variables.

Oh, ebay, I see. I was thinking about non-auction sites. On auction
sites, some of the sales data are public, I suppose.
John
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: canigou9 (remove your socks to reply) | last post by:
(cross posted - comp.databases.ms-access, microsoft.public.access) Hello folks - this is my first post here, after a short lurk. I have written an application in Access2002 for a friend's...
9
by: Mark | last post by:
I want to put out a Excel or Access database spreadsheet on a web page. Can I do this and will it let me update the spreadsheet or will it be a static web page? If not then should I put it out...
2
by: John C | last post by:
I am trying to develop a access database version 2002 from scratch and I am a novice programmer and need much direction. I have been researching and studying about relational database design and...
1
by: Leon | last post by:
Hi, I am trying to find out what are the big differences between access database Jet engine and SQL Server? If I have serveral web applications (same content different copies) running as different...
3
by: Lyle Fairfield | last post by:
In a recent thread there has been discussion about Data Access Pages. It has been suggested that they are not permitted on many or most secure sites. Perhaps, that it is so, although I know of no...
31
by: Cy | last post by:
Hi all, I wanted to start a thread that might help many of us. I worked for a company for 12 years, until this past Christmas when they let me go. Getting rid of the higher dollar guys, in...
13
by: Alan Silver | last post by:
Hello, MSDN (amongst other places) is full of helpful advice on ways to do data access, but they all seem geared to wards enterprise applications. Maybe I'm in a minority, but I don't have those...
3
by: Kevin Killion | last post by:
I've recently features to a system by making use of some simple MySQL access routines. It works fine at most test locations, correctkly accessing a MySQL database on an internet server. ...
0
by: gpl666666 | last post by:
I use Microsoft Access to create a webpage I publish this webpage to our company server, it gives two messages when I open the webpage, one is: "This website uses a data provider that may be...
5
by: John | last post by:
I have an ASP.NET 2.0 application developed in VB.net, that accesses an Microsoft Access database. When the database is on the same IIS server all works just fine. BUT when it tried to access the...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.