473,659 Members | 3,031 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

Regular Expressions to parse HTML

I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">no t interested in this</div>
<div class="data">In terested in data from this first data div</div>
<div class="data">Th ere can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInteres ted.jpg">
some other rubbish
<div class="footer"> not interested</div>
May 2 '06 #1
1 2259
Patrick wrote:
I need to parse and HTML document of the following format.

I am interested to obtain all the HTML from and including the first <div
class="data"> up to and including Data updated dd/mm/yyyy (where dd/mm/yyyy
will change). what kind of regular expressions can I use? Note I want
everything in the core of the HTML including all the tags within the div tags.


Treating the input Html as one string (C# code):

Regex regex = new Regex(@"(<div class=""data""> .*(?=<img))",
RegexOptions.Si ngleline);
Sample input:
<html>
<head>
<!-- Not interested in parsing data in the header-->
</head>
<body>
<div class="head">no t interested in this</div>
<div class="data">In terested in data from this first data div</div>
<div class="data">Th ere can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
<img src="notInteres ted.jpg">
some other rubbish
<div class="footer"> not interested</div>

Sample output:
1 =»<div class="data">In terested in data from this first data div</div>
<div class="data">Th ere can be <b>other tags</b> within these divs too!</div>
<a name="data3"></a>(There can be some other stuff in between the div tags)
Data updated dd/mm/yyyy
«=

--
Take care,
Ken
(to reply directly, remove the cool car. <sigh>)
May 3 '06 #2

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

11
3900
by: Martin Robins | last post by:
I am trying to parse a string that is similar in form to an OLEDB connection string using regular expressions; in principle it is working, but certain character combinations in the string being parsed can completely wreck it. The string I am trying to parse is as follows: commandText=insert into (Text) values (@message + N': ' + @category);commandType=StoredProcedure; message=@message; category=@category I am looking to retrive name value...
3
1624
by: Bryan | last post by:
Hi All: I'm trying to find the right Regexp string to remove empty SPAN tags from an HTML string. Say I have a string like so, and I want to remove the empty span tags: <span>This is my text</span> A simple expression like this /<SPAN>(.*)?<\/SPAN>/gi will give me the
4
1226
by: Befuddled | last post by:
I am writing a function to have its argument, HTML-containing string, return a DOM 1 Document Fragment, and so it seems the use of regular expressions (REs) is a natural. My problem is that the browsers (IE and Mozilla) that I am using to write and debug have a different idea about parsing strings using REs. Here is the starting example: stringPtr = "<div id=\"errblock\" style=\"color:red;\">" + "<p>This is a simple doc frag";
7
2185
by: Patient Guy | last post by:
Coding patterns for regular expressions is completely unintuitive, as far as I can see. I have been trying to write script that produces an array of attribute components within an HTML element. Consider the example of the HTML element TABLE with the following attributes producing sufficient complexity within the element: <table id="machines" class="noborders inred" style="margin:2em 4em;background-color:#ddd;">
18
3026
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How ??
4
395
by: rufus | last post by:
I need to parse some HTML and add links to some keywords (up to 1000) defined in a DB table. What I need to do is search for these keywords and if they are not already a link, and they are not inside a paragraph tag, ie <p class=tab>, I convert it to a link. There are also 8 other conditions to decide whether text is converted to a link. Is there an easy way to compare match collections that are returned from separate regular...
1
2273
by: passion_to_be_free | last post by:
I am writing a javascript that will make an http request, sort through the html for any links on the page, and then store them for future processing. To test things, I pasted html source code into my text editor, ran a GREP search using the following regular expression: (<a *href\s*=\s*")(+)("*>) And the appropriate links were highlighted correctly. I then
20
416
by: Asper Faner | last post by:
I seem to always have hard time understaing how this regular expression works, especially how on earth do people bring it up as part of computer programming language. Natural language processing seems not enough to explain by the way. Why no eliminate it ?
20
3406
by: Geoff Hill | last post by:
What's the way to go about learning Python's regular expressions? I feel like such an idiot - being so strong in a programming language but knowing nothing about RE.
13
7473
by: Wiseman | last post by:
I'm kind of disappointed with the re regular expressions module. In particular, the lack of support for recursion ( (?R) or (?n) ) is a major drawback to me. There are so many great things that can be accomplished with regular expressions this way, such as validating a mathematical expression or parsing a language with nested parens, quoting or expressions. Another feature I'm missing is once-only subpatterns and possessive quantifiers...
0
8428
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
8851
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
8748
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8531
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8628
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7359
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
6181
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
5650
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
2
1978
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.