473,404 Members | 2,213 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,404 software developers and data experts.

Regular Expressions to parse HTML

I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>
May 2 '06 #1
1 2243
Patrick wrote:
I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.


Treating the input Html as one string (C# code):

Regex regex = new Regex(@"(<div class=""data"">.*(?=<img))",
RegexOptions.Singleline);
Sample input:
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">not interested in this</div>
<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInterested.jpg">
some other rubbish
<div class="footer">not interested</div>

Sample output:
1 =»<div class="data">Interested in data from this first data div</div>
<div class="data">There can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
«=

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
May 3 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
by: Martin Robins | last post by:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being...
3
by: Bryan | last post by:
Hi All: I'm trying to find the right Regexp string to remove empty SPAN tags from an HTML string. Say I have a string like so, and I want to remove the empty span tags: <span>This is my...
4
by: Befuddled | last post by:
I am writing a function to have its argument, HTML-containing string, return a DOM 1 Document Fragment, and so it seems the use of regular expressions (REs) is a natural. My problem is that the...
7
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. ...
18
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How...
4
by: rufus | last post by:
I need to parse some HTML and add links to some keywords (up to 1000) defined in a DB table. What I need to do is search for these keywords and if they are not already a link, and they are not...
1
by: passion_to_be_free | last post by:
I am writing a javascript that will make an http request, sort through the html for any links on the page, and then store them for future processing. To test things, I pasted html source code...
20
by: Asper Faner | last post by:
I seem to always have hard time understaing how this regular expression works, especially how on earth do people bring it up as part of computer programming language. Natural language processing...
20
by: Geoff Hill | last post by:
What's the way to go about learning Python's regular expressions? I feel like such an idiot - being so strong in a programming language but knowing nothing about RE.
13
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.