473,543 Members | 2,493 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Extract all Img Src tags using Java Regular Expression

1 New Member

I have a huge string containing html tags, some of these tags being <img src="URL"> ones. I need to extract the urls from all the occurences of these tags in the input string. This is what I am doing:

Expand|Select|Wrap|Line Numbers
  1. Pattern p=null;
  2. Matcher m= null;
  3. String word0= null;
  4. String word1= null;
  6. p= Pattern.compile(".*<img[^>]*src=\"([^\"]*)",Pattern.CASE_INSENSITIVE);
  7. m= p.matcher(txt);
  8. while (m.find())
  9.      {
  10.     word0=m.group(1);
  11.     System.out.println(word0.toString());
  12.      }

The problem with this code is that this prints only the last URL. For example if there are 5 <img src="URL"> tags, this code prints only the URL contained withn the 5th< img src> tag. Please tell me how to solve this.

Thanking you in advance
Jan 23 '08 #1
4 21794
1,216 Recognized Expert Top Contributor
Usually when someone wants to extract tags from XML or HTML, it makes sense to parse the input using a proper XML/HTML parser. Have you considered that? For example, what about HTML comments -- they may contain what looks like an image tag...
Jan 23 '08 #2
1 New Member
while (m.find()) change to if (m.find()) you got first img tag scr
and change while to for you get any you want.
Jan 31 '13 #3
Anas Mosaad
185 New Member
Because RegEx matches the biggest match. How that is related to your case? It's the .* at the beginning of your expression. It gets the largest match that is all the document until the start tag of your last img tag. If you moved that to the end of your expression, it will match only the first one. If you want to get all images, just drop it to have something like this:
Expand|Select|Wrap|Line Numbers
  1. p = Pattern.compile("<img[^>]*src=[\"']([^\"^']*)",
  2.                 Pattern.CASE_INSENSITIVE);
P.S: I added support to ' as well as " as valid container of the src URL.

@adeel809, if will match only once. He will never be able to get all images.

@BigDaddyLH, this is a very simple case that doesn't require all these sophistications .
Feb 2 '13 #4
1 New Member
Just change
word0=m.group(1 );
word0=m.group() ;
May 19 '17 #5

Sign in to post your reply or Sign up for a free account.

Similar topics

by: Tim Smith | last post by:
I am looking to extract form element values from html, more generally I have a substring that identifies the beginning of a value and a string that identifies the end of value and I need to extract the substring. My ugly code looks like this: public static String getValue(String data, String begin, String end) { int delimPos =...
by: jarod1701 | last post by:
Hi, I'm currently trying to create a regular expression that can extract certain elements from a url. The url will be of the following form: http://user:pass@www.sitename.com I want a regex that matches the "user" part of that url.
by: ksr | last post by:
Hi, I am looking for a regular expression that would extract UNC paths from a given string and place that inside a href. Currently the expression fails if there is a space in the path.. eg. \\server\my doc.doc is there a regular expression to do this? TIA,
by: James D. Marshall | last post by:
The issue at hand, I believe is my comprehension of using regular expression, specially to assist in replacing the expression with other text. using regular expression (\s*) my understanding is that this will one or more occurrences to replace all the white space between with a comma. This search ElseIf InStr(1, indivline, "$") Then insert...
by: teo | last post by:
hallo, I need to extract a word and few text that precedes and follows it (about 30 + 30 chars) from a long textual document. Like the description that Google returns when it has found a given word. In example from:
by: roberto321 | last post by:
Hi Guys, I was wondering if someone could help me out with the following requirements <mydocument> <div id="other"> <a href="linkother">linkother</a> </div> <div id="hello">
by: trihanhcie | last post by:
Hi, I would like to extract the text in an HTML file For the moment, I'm trying to get all text between <tdand </td>. I used a regular expression because i don't know the "format between <tdand </td> It can be : <tdtext1 </td> or
by: Andy B | last post by:
I need to create a regular expression that will match a 5 digit number, a space and then anything up to but not including the next closing html tag. Here is an example: <startTag>55555 any text</aClosingTag> I need a Regex that will get all of the text between the html tags above (the html tags are random and i do not know them before...
by: rizwan6feb | last post by:
I am trying to extract php code from a php file (php file also contains html, css and javascript code). I am using the following regex for this <\?*?\?> but this doesn't cater quotation marks (single and double quotes) and comments, i mean how can i skip php tags inside a string (and comments). Please have a look at the following code ...
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language...
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. ...
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that...
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For...
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then...
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert...
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.