473,406 Members | 2,816 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,406 software developers and data experts.

Parsing a website - strategy

Hi,

recently I got a project to get info from different websites and to put
the info into a DB.
Now, I was wondering what is the best technique to implement something
like that.

How I should open the pages from other websites. With fopen, throught a
socket or with a curl.

After that what is the faster way to parse a whole page for info.. and
offcourse to parse it little times to get different info from the same
page.

Regards

May 6 '06 #1
1 1489
aka_eu wrote:
Hi,

recently I got a project to get info from different websites and to put
the info into a DB.
Now, I was wondering what is the best technique to implement something
like that.

How I should open the pages from other websites. With fopen, throught a
socket or with a curl.
Either way works, depends what website you are accessing and what you
need to do. If your answer to any of the questions if yes then use
curl.
Will your script need to auto-submit any forms to these websites? Do
any of the sites use cookies? If a page is inaccessible do you need to
know why?

file_get_contents is the easiest way but not informative if the webpage
was inacessible and it can only perform simple get requests.

Curl can has comprehensive error reporting and you can post forms using
setopt CURLOPT_POST and CURLOPT_POSTFIELDS, and it can deal with cookie
based websites, pretend its a browser/bot and has plenty of other
useful stuff.

You could do all this yourself using sockets but its already been done
with curl and sooo tedious.

After that what is the faster way to parse a whole page for info.. and
offcourse to parse it little times to get different info from the same
page.


Best use DOM.

I've seen some people use regular expressions to do it but the regexes
soon end up being a nightmare to maintain or change when the website
inevitably changes. But if you're only looking for a few pieces of
information from a few sites preg_match could work.

With Dom you parse the page into a domtree using
DOMDocument->loadHTML(), then use the dom methods and xpath to get what
you want. Especially xpath....

Don't know if its fastest to execute during runtime but if anyone knows
a more flexible, useful way of data mining I need to know.

The dom method getElementById doesn't work unless the page has a proper
doctype ( meaning most webpages )
http://blog.bitflux.ch/wiki/GetElementById_Pitfalls explains the
problem and the solutions, there's a straightforward example of using
xpath as well.
http://www.zvon.org/xxl/XPathTutoria.../examples.html is a good
xpath tutorial, ugly site but there's plenty of good examples to learn
from and an interactive lab.

Seeya

Tim

May 6 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
by: Kylotan | last post by:
I have a text file where the fields are delimited in various different ways. For example, strings are terminated with a tilde, numbers are terminated with whitespace, and some identifiers are...
0
by: Roshawn Dawson | last post by:
Hi, I recently place my first asp.net website on the net. All is working well. However, I see a few changes that I'd like to make and want to know what is the best strategy to use. I can...
0
by: rufus | last post by:
I need to parse HTML output and find all instances of a word/phrase and then convert it to a link. We have a reasonably large product catalogue. If a particular product page contains the name...
9
by: ankitdesai | last post by:
I would like to parse a couple of tables within an individual player's SHTML page. For example, I would like to get the "Actual Pitching Statistics" and the "Translated Pitching Statistics"...
5
by: mailtogops | last post by:
Hi All, I am involved in one project which tends to collect news information published on selected, known web sites inthe format of HTML, RSS, etc and sortlist them and create a bookmark on our...
6
by: Jacob Rael | last post by:
Hello, I have a simple script to parse a text file (a visual basic program) and convert key parts to tcl. Since I am only working on specific sections and I need it quick, I decided not to...
15
by: linq936 | last post by:
Hi, I am reading book <<Expert C Programming>>, it has the following quiz, a //* //*/ b In C and C++ compiler what does the above code trun out? I think it is simple for C compiler, it is...
0
by: savj14 | last post by:
I have been driving myself crazy the past few days trying to figure this out. I have tried different Parsing Scripts and have read and searched various things trying to find a solution. I am...
3
by: Ananthu | last post by:
Hi I have created one website named OTMS using ASP.NET in a File System Format.My project location is in F: drive(F:\Project\OTMS). I have installed IIS properly and the website runs properly in...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.