473,385 Members | 1,720 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,385 software developers and data experts.

Scraping Just Images in C#

I'm trying to create a page where users can just enter a url into a textbox, click search, and then we list all of the images on that exact page.

So far I've managed to get it to scrape the entire destination page, and using a regular expression, kind of extract the image. The problem is that the images are sometimes relative paths, which means they won't display.

Unless I'm missing something very obvious, does anyone know of any solutions for this kind of thing? I'm also hoping to do the same with embedded tags, so I can list up things like flash videos, etc.

I do realise that a lot of this code needs tidying up, but if anyone has anything I'd be very grateful.

For example, if I searched for www.google.com, I would want it to list the main google image, but instead of getting the path like this:
http://www.google.co.uk/intl/en_uk/images/logo.gif

I get the path like this:
/intl/en_uk/images/logo.gif

Expand|Select|Wrap|Line Numbers
  1.     public void doSearch(object sender, EventArgs e)
  2.     {
  3.         results_tbl.Rows.Clear();
  4.         string reqURL = url_searchBox_txt.Text;
  5.  
  6.         if (!reqURL.StartsWith("http://"))
  7.         {
  8.             reqURL = "http://" + reqURL;
  9.         }
  10.  
  11.         WebRequest req = WebRequest.Create(reqURL);
  12.         WebResponse resp = req.GetResponse();
  13.  
  14.         Stream s = resp.GetResponseStream();
  15.         StreamReader sr = new StreamReader(s,Encoding.ASCII);
  16.  
  17.         string st = sr.ReadToEnd();
  18.  
  19.         Regex r = new Regex(@"<img([^>]+)>",RegexOptions.IgnoreCase | RegexOptions.Compiled);
  20.         Match m = r.Match(st);
  21.         while (m.Success)
  22.         {
  23.                 TableRow tr = new TableRow();
  24.                 TableCell tc1 = new TableCell();    //Item
  25.  
  26.                 tc1.Text = "<img " + m.Groups[1].Value + "/>";
  27.  
  28.                 tr.Cells.Add(tc1);
  29.                 tr.Cells.Add(tc2);
  30.  
  31.                 results_tbl.Rows.Add(tr);
  32.                 m = m.NextMatch();
  33.         }
  34.     }
  35.  
Nov 12 '09 #1
2 7190
Bassem
344 100+
You have solved one of three, not one of two!!

Pay attention to that:
The src attribute - of the img element - content is a link to a URL so its contents type is one of these:
1. Fully qualified URL.
2. Absolute.
3. Relative.

You have solved the first type, it remains two more.

Anyway, consider this method:
1. You have "url_searchBox_txt.Text" it contains the URL has a type of three, but all contain the domain name (host name), you can split it.
2. Extract the img's src property, compare the value if it begins with the domain name... so it is type #1.
Else if it begins with "/" slash... so it is type #2.
Else... it is type #3.
3. For type #1: go on.
For type #2: insert the domain name into the start of the value. That's it, very simple.
For type #3: Oh, now you got a problem, you will need to search in the website directories and I have no idea how to solve this.

Thanks,
Bassem
Nov 13 '09 #2
The problem is so simple.Look,A web page can import image or media file from its local server or remote server.When the page import image from external server the image url looks like:
<img src="http://www.domain.com/01.jpg></img>
But when the page import image from local server then the image reference looks like:
<img src="/images/01.jpg".
So to fix the problem,just add the http url path at the begining looks: "htt://www.google.com/"+img_result
Sep 29 '10 #3

Sign in to post your reply or Sign up for a free account.

Similar topics

4
by: Roland Hall | last post by:
Am I correct in assuming screen scraping is just the response text sent to the browser? If so, would that mean that this could not be screen scraped? function moi() { var tag = '<a href='; var...
27
by: gRizwan | last post by:
Hello all, We have a problem on a webpage. That page is sent some email data in base64 format. what we need to do is, decode the base64 data back to original shape and extract attached image...
4
by: rachel | last post by:
Hello, I am currently contracted out by a real estate agent. He has a page that he has created himself that has a list of homes.. their images and data in html format. He wants me to take...
2
by: Victor | last post by:
I'm doing screen scraping by retrieving data from one site and entering into another site. I have a problem with logging into the site. User name and password field contain 'name' property, and...
2
by: Victor | last post by:
Hi, I have a problem with logging into web site via screen scraping. User name and password field contain 'name' property, and therefore I can easily do assignment to them:...
8
by: darrel | last post by:
Is there a way to prohibit images from being viewed (linked directly to) unless they are loaded from a page in my application? I'm working on a project where one can create galleries of images....
7
by: ljr2600 | last post by:
Hello, I'm very new to python and still familiarizing myself with the language, sorry if the post seems moronic or simple. For a side project I'm working on I need to be able to scrape a...
6
by: wattanabi | last post by:
Greetings, I'm attempting to layout a bunch of images in a grid using DIV's instead of a table. I currently have a 3x6 table that I need to convert to css. I've seen various example of a 3 to 4...
3
by: John Kotuby | last post by:
I have just upgraded to a new development machine that came with Vista ultimate. I am developing a website with VS2005 and VB. My image and css references in my source code are all relative. For...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: aa123db | last post by:
Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...
0
by: ryjfgjl | last post by:
In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.