473,830 Members | 2,109 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Access to database other web sites

I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.

I am studying python these days because I thinks it is a good language
for the work. Actually I am a novice at python.

I welcome any informaion about this problem. Thanks in advance.
Jul 18 '05 #1
7 1852
ti****@hanmail. net (Jenny) writes:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.
IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).

I am studying python these days because I thinks it is a good language
for the work. [...]

I think so too.

I welcome any informaion about this problem. Thanks in advance.


In the standard library, you'll want to look at these modules: httplib
(low level HTTP -- you probably don't want to use this), urllib2
(opens URLs as if they were files, handles redirections, proxies
etc. for you) and HTMLParser. The standard library also includes
sgmllib & htmllib, but you'll probably want to use HTMLParser instead
if you want that kind of event-driven parsing at all. Regular
expressions (re module) can also come in handy.

Personally, I've decided that I prefer the DOM style of parsing for
anything complicated -- it's just less work than the event-driven
style (though I don't much like the DOM API). PyXML has an HTML DOM
implementation called 4DOM. Use that together with mxTidy or
uTidylib: they will clean up the horrid HTML you'll find on the web to
the point where 4DOM can make sense of it. Another option is to use
mxTidy/uTidylib to output XHTML, which allows you to use any XML DOM
implementation -- eg. pxdom, minidom, libxml...

You might find my modules useful too. ClientCookie has an interface
just like urllib2 (and uses it to do its work), but handles cookies
and some other stuff too. ClientForm makes it easier to work with
HTML forms. ClientTable is currently a heap of junk, don't use it ;-)
I've just rewritten ClientForm on top of the DOM, which lets you
switch back and forth between the two APIs (and also lets you handle
JavaScript, rather badly ATM) -- coming RSN...

http://wwwsearch.sourceforge.net/
The other, completely different, way of web scraping is to use the
"automation " capabilities of the various big web browsers: Microsoft
Internet Explorer, KDE's Konqueror and Mozilla are all scriptable from
Python. You need the Python for Windows extensions, PyKDE or PyXPCOM
respectively to control those browsers. Advantages: easy handling of
JavaScript and other assorted nonsense, and they're generally
reasonably well-tested and stable pieces of software (not to mention
de-facto standards). Disadvantages: poor portability in some cases,
and they're rather big, complicated, closed applications that are hard
to modify (compared to the pure Python approach) and to distribute
(which last, I guess, isn't a problem for you, since you'll be the
only one using your software). Other problems: COM (for MSIE) is a
bit of a headache for newbies, PyXPCOM last time I looked seemed a
pain to install (Brendan Eich mentioned in a newsgroup post that that
has been changing recently, though), and PyKDE might not be that well
tested (it's a very big wrapper!).

One other bunch of software worthy of mention: you can use Jython to
access various Java libraries. HTTPClient and httpunit look like they
might be useful. In particular, the latter has some JavaScript
support.
John
Jul 18 '05 #2
jj*@pobox.com (John J. Lee) writes:
ti****@hanmail. net (Jenny) writes:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this

[...]

Forgot to say: if you don't already know, Google Groups can be worth
its weight in round tuits. Try some searches there, in
comp.lang.pytho n, on the stuff I mentioned.
John
Jul 18 '05 #3
| IIUYC, what you're contemplating is called "web scraping"
| ....

John ....

I did a bit of web scraping over the past week end
for a friend that is interested in Lotto numbers ....

The Lotto numbers were readily available on the web
and presented as well-formed and readable HTML tables ....

The primary problem I found up front was to be able
parse and transform this data into something
that Python, or any other language, might be able
to cope with for subsequent analysis ....

Since the number of records that I was dealing with
in this case was relatively small, only a couple of thousand,
I could manage the initial data transformations
using my genetically encoded EyeBall parser,
a text editor, and a couple of one-off Python scripts ....

The first step in each case for the source files
was using HTML Tidy to ...

"clean up the horrid HTML you'll find on the web "

I'd like to empashize for the benefit of the original poster
that the initial data parsing will probably entail a fair amount
of non-trivial work and that the subsequent data analysis
and reporting will seem almost trivial by comparison ....

Thanks for posting the info regarding different approaches,
as I think it will be useful for me when I get around
to replacing my EyeBall parser with something more effective ....

--
Cousin Stanley
Human Being
Phoenix, Arizona

Jul 18 '05 #4
In article <87************ @pobox.com>, John J. Lee <jj*@pobox.co m> wrote:
ti****@hanmail .net (Jenny) writes:
I am doing research about realationship between sales rates and
discounted prices or recommendation frequency. To do this, I need to
access the database of commercial web sites via internet. I think this
is possible because it it simmilar to the work of price comparison
sites and web robot.


IIUYC, what you're contemplating is called "web scraping" -- at least,
it is by Cameron Laird, and I like the name. Others might know it as
"web client programming". Cameron wrote an article about this a while
back (Unix Review?) which you might like if you're a newbie -- Google
for it (but note that the Perl book he mentions has actually been
replaced by a newer one by Sean Burke, also from O'Reilly).

I am studying python these days because I thinks it is a good language
for the work.

[...]

I think so too.

Jul 18 '05 #5
cl****@lairds.c om (Cameron Laird) writes:
In article <87************ @pobox.com>, John J. Lee <jj*@pobox.co m> wrote:
ti****@hanmail .net (Jenny) writes:
[...] I'm ... reserved about the prospects for the proposed research. The commercial
sites you want to study are, in my experience, some of the most difficult to
"scrape".
Which (ATM, anyway) is a good reason for doing it with browser automation.

Complementing that difficulty is the poverty of inference I antici-
pate you'll be able to ground on what you find there; their commerce has a lot
more noise than signal, as I see it.
What do you mean 'their commerce has more noise than signal'?

'Twould be great, though, for you to
uncover something real. Good luck.


What I was wondering was where the sales data are going to come from.
John
Jul 18 '05 #6
In article <87************ @pobox.com>, John J. Lee <jj*@pobox.co m> wrote:
Jul 18 '05 #7
cl****@lairds.c om (Cameron Laird) writes:
[...John wrote:]
What I was wondering was where the sales data are going to come from.

.
.
.
That's a typical part. As I understand Jenny, she's going
to look at, say, eBay, and correlate "sales" with "price"
and "marketing" variables.

Oh, ebay, I see. I was thinking about non-auction sites. On auction
sites, some of the sales data are public, I suppose.
John
Jul 18 '05 #8

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
9198
by: canigou9 (remove your socks to reply) | last post by:
(cross posted - comp.databases.ms-access, microsoft.public.access) Hello folks - this is my first post here, after a short lurk. I have written an application in Access2002 for a friend's business which allows him to take orders, arrange deliveries and trace deliveries back in the event of subsequent complaints. It uses a handful of tables in a relational schema, with a few ancillary lookup tables, some forms, reports and VBA modules.
9
1688
by: Mark | last post by:
I want to put out a Excel or Access database spreadsheet on a web page. Can I do this and will it let me update the spreadsheet or will it be a static web page? If not then should I put it out there as a file on the web page and open it up and then update the spreadsheet? Please advise.
2
1821
by: John C | last post by:
I am trying to develop a access database version 2002 from scratch and I am a novice programmer and need much direction. I have been researching and studying about relational database design and normalization I am including the 9 tables and there fields that I have already developed. I have and I am not sure what to name the tables and I am thinking of making the incident # field the pimary key?? Can I put that in all the tables? I...
1
2200
by: Leon | last post by:
Hi, I am trying to find out what are the big differences between access database Jet engine and SQL Server? If I have serveral web applications (same content different copies) running as different sites(on IIS), each site using an access as a back-end database, all sites will be running simultaneously, all the access databases are on the same machine, on this approach, does each copy of access has its own Jet engine to handle database...
3
3504
by: Lyle Fairfield | last post by:
In a recent thread there has been discussion about Data Access Pages. It has been suggested that they are not permitted on many or most secure sites. Perhaps, that it is so, although I know of no site that has this prohibition, and I have uploaded DAPs to various sites and used them from those sites. I do not understand why any site manager would prohibit DAPs. To the best of my knowledge DAPs, as HTM files, are merely hosted on the...
31
3040
by: Cy | last post by:
Hi all, I wanted to start a thread that might help many of us. I worked for a company for 12 years, until this past Christmas when they let me go. Getting rid of the higher dollar guys, in favor of more profit, was the reasoning. Oh well. Anyhow. I built this companies network from the ground up. Then about 8 years ago, I developed a MS Access 2.0 database. This database, much like others, was supposed to just tiny bits. Over...
13
3121
by: Alan Silver | last post by:
Hello, MSDN (amongst other places) is full of helpful advice on ways to do data access, but they all seem geared to wards enterprise applications. Maybe I'm in a minority, but I don't have those sorts of clients. Mine are all small businesses whose sites will never reach those sorts of scales. I deal with businesses whose sites get maybe a few hundred visitors per day (some not even that much) and get no more than ten orders per day....
3
9692
by: Kevin Killion | last post by:
I've recently features to a system by making use of some simple MySQL access routines. It works fine at most test locations, correctkly accessing a MySQL database on an internet server. However, at certain test sites, I cannot access the database. We're online, and I can access web files, for example. But when trying to say hello to that MySQL database, the program fails at the "mysql_real_connect" call. The message is "Can't connect...
0
2895
by: gpl666666 | last post by:
I use Microsoft Access to create a webpage I publish this webpage to our company server, it gives two messages when I open the webpage, one is: "This website uses a data provider that may be unsafe. If you trust the website, click ok, otherwise click cancel" Another msg is: "This website is using your identity to access a data source. If you trust this website click ok to continue, otherwise click cancel" I have to click OK for both...
5
2189
by: John | last post by:
I have an ASP.NET 2.0 application developed in VB.net, that accesses an Microsoft Access database. When the database is on the same IIS server all works just fine. BUT when it tried to access the same database on a different server I get a permission error. I've created a shared drive on the other server and give it permission with all rights, as well as the Access database. BUT no matter how I set up the connection string in my app, I...
0
9777
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9636
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10473
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
0
9307
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7737
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6939
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5773
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
1
4408
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
3
3070
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.