473,396 Members | 1,714 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

sk

I am developing a program to crawl a site( looks like craigslist ).
Since they have more than 20,000 entries I have to go to each
categories
site
, parse with regular expression and extract data to database. This data
will
be updated every two days.

The program i am analyzing now is that I have a number of clients site
running
on the same machine and if my program occupies the cpu usages( more
than
80% )
web server might hang and won't accept any connection from outside
until I
reboot
my server.

I came up with some idea to reduce process overhead.
1. go to the site and download all sites without parsing.
2. once all sites have been downloaded to local starts parsing.
3. save all data in a database.

if any has a better idea let me know.

SK

Aug 11 '06 #1
6 1104
sm************@gmail.com wrote:
I am developing a program to crawl a site( looks like craigslist ).
Since they have more than 20,000 entries I have to go to each
categories
site
, parse with regular expression and extract data to database. This data
will
be updated every two days.

The program i am analyzing now is that I have a number of clients site
running
on the same machine and if my program occupies the cpu usages( more
than
80% )
web server might hang and won't accept any connection from outside
until I
reboot
my server.

I came up with some idea to reduce process overhead.
1. go to the site and download all sites without parsing.
2. once all sites have been downloaded to local starts parsing.
3. save all data in a database.

if any has a better idea let me know.

SK
If you can get access to the database you are better of replicating this
database... but I guess that is not an option...

Jonathan
Aug 11 '06 #2
unfortunately not

Aug 11 '06 #3

sm************@gmail.com wrote:
I am developing a program to crawl a site( looks like craigslist ).
Since they have more than 20,000 entries I have to go to each
categories
site
, parse with regular expression and extract data to database. This data
will
be updated every two days.

The program i am analyzing now is that I have a number of clients site
running
on the same machine and if my program occupies the cpu usages( more
than
80% )
web server might hang and won't accept any connection from outside
until I
reboot
my server.

I came up with some idea to reduce process overhead.
1. go to the site and download all sites without parsing.
2. once all sites have been downloaded to local starts parsing.
3. save all data in a database.

if any has a better idea let me know.

SK
Try using the rss feeds.

Aug 11 '06 #4

Richard,

Thank you for sharing your idea.
>Try using the rss feeds.
I don't know much about the rss feeds, but it should be set up
in the source site right? Say if i want to get some data from
www.mysite.com
mysite.com has to provide the rss xml file right?

What i want to do is the crawling the external sites and extracting
data.

SK

Aug 11 '06 #5

sm************@gmail.com wrote:
The program i am analyzing now is that I have a number of clients site running
on the same machine and if my program occupies the cpu usages( more than 80% )
web server might hang and won't accept any connection from outside until I reboot my server.

Hard to tell from your post...is this script running through the web
server as a page?

If so, you might try converting it to a command line script and run it
via cron. In that situation, ideally, the cpu should use appropriate
process threading to prevent the web server from locking up.

Aug 11 '06 #6
sm************@gmail.com wrote:
>Try using the rss feeds.
I don't know much about the rss feeds, but it should be set up
in the source site right? Say if i want to get some data from
www.mysite.com
mysite.com has to provide the rss xml file right?
Yes, the feed must come from the source.

Carl
Aug 11 '06 #7

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: William C. White | last post by:
Does anyone know of a way to use PHP /w Authorize.net AIM without using cURL? Our website is hosted on a shared drive and the webhost company doesn't installed additional software (such as cURL)...
2
by: Albert Ahtenberg | last post by:
Hello, I don't know if it is only me but I was sure that header("Location:url") redirects the browser instantly to URL, or at least stops the execution of the code. But appearantely it continues...
3
by: James | last post by:
Hi, I have a form with 2 fields. 'A' 'B' The user completes one of the fields and the form is submitted. On the results page I want to run a query, but this will change subject to which...
0
by: Ollivier Robert | last post by:
Hello, I'm trying to link PHP with Oracle 9.2.0/OCI8 with gcc 3.2.3 on a Solaris9 system. The link succeeds but everytime I try to run php, I get a SEGV from inside the libcnltsh.so library. ...
1
by: Richard Galli | last post by:
I want viewers to compare state laws on a single subject. Imagine a three-column table with a drop-down box on the top. A viewer selects a state from the list, and that state's text fills the...
4
by: Albert Ahtenberg | last post by:
Hello, I have two questions. 1. When the user presses the back button and returns to a form he filled the form is reseted. How do I leave there the values he inserted? 2. When the...
1
by: inderjit S Gabrie | last post by:
Hi all Here is the scenerio ...is it possibly to do this... i am getting valid course dates output on to a web which i have designed ....all is okay so far , look at the following web url ...
2
by: Jack | last post by:
Hi All, What is the PHP equivilent of Oracle bind variables in a SQL statement, e.g. select x from y where z=:parameter Which in asp/jsp would be followed by some statements to bind a value...
3
by: Sandwick | last post by:
I am trying to change the size of a drawing so they are all 3x3. the script below is what i was trying to use to cut it in half ... I get errors. I can display the normal picture but not the...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.