473,397 Members | 2,099 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,397 software developers and data experts.

Spider and get tag information of one web page

Hi all
i would like to know if anyone knows about a code sample.
Lets say for example
http://shopping.yahoo.com/search;_yl...did=&x=51&y=10

As you can see that there is a lot of items.
I need to be able to get the image link, navigate url, price,
description etc. of each item and then store them in a database.

I know that there is a way of searching in the html code and return
values (but don't know how)
And help would be appreciated.
Thank you,

May 9 '07 #1
3 1145
On May 9, 5:37 am, "discounton...@gmail.com" <discounton...@gmail.com>
wrote:
I know that there is a way of searching in the html code and return
values (but don't know how)
Use Regular Expressions.
More info: http://www.google.com/search?hl=en&q...ssions+asp.net

In your case you should get the text and parse it using patterns.

Here's the complete pattern to get the link, name, description and
price:

(?<=\<h2\>\<a\shref=\")
(?<url>(.|\n)*?)(\"\>)(?<name>(.|\n)*?)(\<\/a\></h2\>\n\<br\/\>)
(?<description>(.|\n)*?)(\n)
(.|\n)*?
(\<span\sclass\=\"price\"\>)(?<price>.*?)(\<\/span\>)

Note, in the code it has to be in one line.

Here's an example of the code:

string t = "html_from_yahoo";
string e = "(?<=\<h2\>............(\<\/span\>)";

Regex r = new Regex(e, RegexOptions.Compiled);
MatchCollection matches = r.Matches(t);

foreach (Match m in matches)
{
Response.Write("name="+match.Groups["name"]);
Response.Write("description="+match.Groups["name"]);
Response.Write("url="+match.Groups["url"]);
Response.Write("price="+match.Groups["price"]);
}

Hope it helps

May 9 '07 #2
On May 9, 3:24 am, Alexey Smirnov <alexey.smir...@gmail.comwrote:
On May 9, 5:37 am, "discounton...@gmail.com" <discounton...@gmail.com>
wrote:
I know that there is a way of searching in the html code and return
values (but don't know how)

Use Regular Expressions.
More info:http://www.google.com/search?hl=en&q...ssions+asp.net

In your case you should get the text and parse it using patterns.

Here's the complete pattern to get the link, name, description and
price:

(?<=\<h2\>\<a\shref=\")
(?<url>(.|\n)*?)(\"\>)(?<name>(.|\n)*?)(\<\/a\></h2\>\n\<br\/\>)
(?<description>(.|\n)*?)(\n)
(.|\n)*?
(\<span\sclass\=\"price\"\>)(?<price>.*?)(\<\/span\>)

Note, in the code it has to be in one line.

Here's an example of the code:

string t = "html_from_yahoo";
string e = "(?<=\<h2\>............(\<\/span\>)";

Regex r = new Regex(e, RegexOptions.Compiled);
MatchCollection matches = r.Matches(t);

foreach (Match m in matches)
{
Response.Write("name="+match.Groups["name"]);
Response.Write("description="+match.Groups["name"]);
Response.Write("url="+match.Groups["url"]);
Response.Write("price="+match.Groups["price"]);

}

Hope it helps
I have the full string of the page.
I would like to know what the syntext for example is to find all the
full string from <table class="item_table"
Until the next one and return it as a string

May 13 '07 #3
On May 13, 10:08 pm, "discounton...@gmail.com"
<discounton...@gmail.comwrote:
On May 9, 3:24 am, Alexey Smirnov <alexey.smir...@gmail.comwrote:


On May 9, 5:37 am, "discounton...@gmail.com" <discounton...@gmail.com>
wrote:
I know that there is a way of searching in the html code and return
values (but don't know how)
Use Regular Expressions.
More info:http://www.google.com/search?hl=en&q...ssions+asp.net
In your case you should get the text and parse it using patterns.
Here's the complete pattern to get the link, name, description and
price:
(?<=\<h2\>\<a\shref=\")
(?<url>(.|\n)*?)(\"\>)(?<name>(.|\n)*?)(\<\/a\></h2\>\n\<br\/\>)
(?<description>(.|\n)*?)(\n)
(.|\n)*?
(\<span\sclass\=\"price\"\>)(?<price>.*?)(\<\/span\>)
Note, in the code it has to be in one line.
Here's an example of the code:
string t = "html_from_yahoo";
string e = "(?<=\<h2\>............(\<\/span\>)";
Regex r = new Regex(e, RegexOptions.Compiled);
MatchCollection matches = r.Matches(t);
foreach (Match m in matches)
{
Response.Write("name="+match.Groups["name"]);
Response.Write("description="+match.Groups["name"]);
Response.Write("url="+match.Groups["url"]);
Response.Write("price="+match.Groups["price"]);
}
Hope it helps

I have the full string of the page.
I would like to know what the syntext for example is to find all the
full string from <table class="item_table"
Until the next one and return it as a string- Hide quoted text -

- Show quoted text -
I guess, something similar to the

(\<table\sclass\=\"item_table\")(.|\n)*?(?=\<table \sclass\=\"item_table
\")

May 13 '07 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

4
by: Kyle Mizell | last post by:
I am looking for a script that I can use to spider a website, and then pull the images... I know how to do it for a single page, but, I would like to be able to do this for the entire site. Any...
5
by: jdonnell | last post by:
I've been writing a simple web spider for fun, and I've run into a problem I can't figure out. The spider hangs (waits for username and pass) when I hit a page that requires .htaccess...
3
by: griffith | last post by:
I need some rather technical spidering advice, and I'm hoping that this is a good place to find it (and my apologies if this isn't). My site contains pages of images, where each image includes a...
0
by: Laszlo Zsolt Nagy | last post by:
Hi All, I'm writting a spider program. I need to go to serveral URLs and extract information from the HTML source. Including links. I was using FancyURLOpener and my own function that extracts...
0
by: dtsearch | last post by:
New release expands-through a .NET Spider API, to Linux, and to OpenOffice-dtSearch's ability to index over a terabyte of text in a single index, with indexed search time typically less than a...
7
by: baroque Chou | last post by:
anyone know how google spiders access web site, how dose they manage to get the href information? do they have special access right or something? any help is appreciated
3
by: Tony Lance | last post by:
Big Bertha Thing spider Cosmic Ray Series Possible Real World System Constructs http://web.onetel.com/~tonylance/spider.html Access page JPG 11K Image Astrophysics net ring Access site...
2
by: abeen | last post by:
Hello, I would want to know which could be the best programming language for developing web spider. More information about the spider, much better,, thanks http://www.imavista.com
2
by: =?Utf-8?B?Q2hhcnRz?= | last post by:
I have been writing C# programs to spider yellow page to get list of restaurant name, address to the database. When I encounter button or hyperlink, I don’t know how to use the program to click...
1
by: tedpottel | last post by:
Hi, I can read the home page using the mechanize lib. Is there a way to load in web pages using filename.html instad of servername/ filename.html. Lots of time the links just have the file...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.