472,331 Members | 1,700 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 472,331 software developers and data experts.

Regular Expressions to parse HTML

I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>
May 2 '06 #1
1 2188
Patrick wrote:
I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.


Treating the input Html as one string (C# code):

Regex regex = new Regex(@"(<div class=""data"">.*(?=<img))",
RegexOptions.Singleline);
Sample input:
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>

Sample output:
1 =»<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
«=

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
May 3 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Martin Robins | last post by:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but...
3
by: Bryan | last post by:
Hi All: I'm trying to find the right Regexp string to remove empty SPAN tags from an HTML string. Say I have a string like so, and I want to...
4
by: Befuddled | last post by:
I am writing a function to have its argument, HTML-containing string, return a DOM 1 Document Fragment, and so it seems the use of regular...
7
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array...
18
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required...
4
by: rufus | last post by:
I need to parse some HTML and add links to some keywords (up to 1000) defined in a DB table. What I need to do is search for these keywords and if...
1
by: passion_to_be_free | last post by:
I am writing a javascript that will make an http request, sort through the html for any links on the page, and then store them for future...
20
by: Asper Faner | last post by:
I seem to always have hard time understaing how this regular expression works, especially how on earth do people bring it up as part of computer...
20
by: Geoff Hill | last post by:
What's the way to go about learning Python's regular expressions? I feel like such an idiot - being so strong in a programming language but knowing...
13
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major...
0
by: tammygombez | last post by:
Hey fellow JavaFX developers, I'm currently working on a project that involves using a ComboBox in JavaFX, and I've run into a bit of an issue....
0
by: concettolabs | last post by:
In today's business world, businesses are increasingly turning to PowerApps to develop custom business applications. PowerApps is a powerful tool...
0
better678
by: better678 | last post by:
Question: Discuss your understanding of the Java platform. Is the statement "Java is interpreted" correct? Answer: Java is an object-oriented...
0
by: teenabhardwaj | last post by:
How would one discover a valid source for learning news, comfort, and help for engineering designs? Covering through piles of books takes a lot of...
0
by: CD Tom | last post by:
This happens in runtime 2013 and 2016. When a report is run and then closed a toolbar shows up and the only way to get it to go away is to right...
0
jalbright99669
by: jalbright99669 | last post by:
Am having a bit of a time with URL Rewrite. I need to incorporate http to https redirect with a reverse proxy. I have the URL Rewrite rules made...
0
by: Matthew3360 | last post by:
Hi there. I have been struggling to find out how to use a variable as my location in my header redirect function. Here is my code. ...
2
by: Matthew3360 | last post by:
Hi, I have a python app that i want to be able to get variables from a php page on my webserver. My python app is on my computer. How would I make it...
0
by: AndyPSV | last post by:
HOW CAN I CREATE AN AI with an .executable file that would suck all files in the folder and on my computerHOW CAN I CREATE AN AI with an .executable...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.