471,854 Members | 1,608 Online
Bytes | Software Development & Data Engineering Community
Post +

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 471,854 software developers and data experts.

Extract all Img Src tags using Java Regular Expression

Hi,

I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I am doing:

Expand|Select|Wrap|Line Numbers
  1. Pattern p=null;
  2. Matcher m= null;
  3. String word0= null;
  4. String word1= null;
  5.  
  6. p= Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
  7. m= p.matcher(txt);
  8. while (m.find())
  9.      {
  10.     word0=m.group(1);
  11.     System.out.println(word0.toString());
  12.      }

The problem with this code is that this prints only the last URL. For example if there are 5 <img src="URL"> tags, this code prints only the URL contained withn the 5th< img src> tag. Please tell me how to solve this.

Thanking you in advance
Jan 23 '08 #1
4 20848
BigDaddyLH
1,216 Expert 1GB
Usually when someone wants to extract tags from XML or HTML, it makes sense to parse the input using a proper XML/HTML parser. Have you considered that? For example, what about HTML comments -- they may contain what looks like an image tag...
Jan 23 '08 #2
while (m.find()) change to if (m.find()) you got first img tag scr
and change while to for you get any you want.
Jan 31 '13 #3
Anas Mosaad
185 128KB
Because RegEx matches the biggest match. How that is related to your case? It's the .* at the beginning of your expression. It gets the largest match that is all the document until the start tag of your last img tag. If you moved that to the end of your expression, it will match only the first one. If you want to get all images, just drop it to have something like this:
Expand|Select|Wrap|Line Numbers
  1. p = Pattern.compile("<img[^>]*src=[\"']([^\"^']*)",
  2.                 Pattern.CASE_INSENSITIVE);
  3.  
P.S: I added support to ' as well as " as valid container of the src URL.

@adeel809, if will match only once. He will never be able to get all images.

@BigDaddyLH, this is a very simple case that doesn't require all these sophistications.
Feb 2 '13 #4
ibilal
1
Just change
word0=m.group(1);
to
word0=m.group();
May 19 '17 #5

Post your reply

Sign in to post your reply or Sign up for a free account.

Similar topics

1 post views Thread by Tim Smith | last post: by
3 posts views Thread by ksr | last post: by
3 posts views Thread by James D. Marshall | last post: by
9 posts views Thread by trihanhcie | last post: by
NeoPa
reply views Thread by NeoPa | last post: by
reply views Thread by YellowAndGreen | last post: by
aboka
reply views Thread by aboka | last post: by

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.