473,474 Members | 1,324 Online
Bytes | Software Development & Data Engineering Community
Create Post

Home Posts Topics Members FAQ

Regular Expression to Parse HTML

Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an array
like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

The array bit should follow, but I don't profess to be a regex expert (or
any kind of expert for that matter). Can anyone help with a suitable
pattern?

TIA

Charles
Nov 21 '05 #1
26 2460
is this usefult for you?

http://regexplib.com/REDetails.aspx?regexp_id=520

Galin Iliev
MCSD, MCAD.NET

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

The array bit should follow, but I don't profess to be a regex expert (or
any kind of expert for that matter). Can anyone help with a suitable
pattern?

TIA

Charles

Nov 21 '05 #2
"Charles Law" <bl***@nowhere.com> schrieb:
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text


Maybe it's easier to use the HTML Agility Pack:

..NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Nov 21 '05 #3
Hi Galin

Thanks for the link. It looks like it ought to work, but when I test it
against even a simple tag it returns no matches. I tried verifying the
expression with Expresso and it gives the following error.

Reference to undefined group number 5.

Even when I test it using the facility on the web site it fails. Any idea
how to correct it?

Charles
"Galin Iliev" <iliev@_NOSPAM_.Galcho.com> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...
is this usefult for you?

http://regexplib.com/REDetails.aspx?regexp_id=520

Galin Iliev
MCSD, MCAD.NET

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

The array bit should follow, but I don't profess to be a regex expert (or
any kind of expert for that matter). Can anyone help with a suitable
pattern?

TIA

Charles


Nov 21 '05 #4
Hi Herfried

It's not my luck day today for getting things to work. When I try to open
the AgilityPack solution I get two errors:

Unable to open project HtmlDomView
Unable to open project GetBinaryRemainder

When I try to run it comes up with 12 compile errors, one of which is a
cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too.

Charles
"Herfried K. Wagner [MVP]" <hi***************@gmx.at> wrote in message
news:eJ**************@TK2MSFTNGP12.phx.gbl...
"Charles Law" <bl***@nowhere.com> schrieb:
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text


Maybe it's easier to use the HTML Agility Pack:

.NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Nov 21 '05 #5
There's an example of just that in my article on the new VBRUN site here:

http://msdn.microsoft.com/vbrun/vbfusion/5000classes/

The expression I used is:

("(?<=href\s*=\s*[""']).*?(?=[""'])")
--
Scott Swigart - MVP
http://blog.swigartconsulting.com
"Charles Law" <bl***@nowhere.com> wrote in message
news:ed**************@TK2MSFTNGP10.phx.gbl...
Hi Herfried

It's not my luck day today for getting things to work. When I try to open
the AgilityPack solution I get two errors:

Unable to open project HtmlDomView
Unable to open project GetBinaryRemainder

When I try to run it comes up with 12 compile errors, one of which is a
cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too.

Charles
"Herfried K. Wagner [MVP]" <hi***************@gmx.at> wrote in message
news:eJ**************@TK2MSFTNGP12.phx.gbl...
"Charles Law" <bl***@nowhere.com> schrieb:
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text


Maybe it's easier to use the HTML Agility Pack:

.NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>


Nov 21 '05 #6
Hi Scott

It looks like this would specifically decode hrefs, and so if I wanted to
decode another tag I would need to change the expression. To decode many
different tags I would need to generate multiple expressions and test
against each; please correct me if I have misunderstood. What I am hoping
for is a generic expression that will decode all tags that conform to the
general html format. I realise that this would also decode tags that are not
valid html, but this would not matter as I have control over the file and
what is in it.

Charles
"Scott Swigart [MVP]" <sc***@swigartconsulting.com> wrote in message
news:OT**************@TK2MSFTNGP09.phx.gbl...
There's an example of just that in my article on the new VBRUN site here:

http://msdn.microsoft.com/vbrun/vbfusion/5000classes/

The expression I used is:

("(?<=href\s*=\s*[""']).*?(?=[""'])")
--
Scott Swigart - MVP
http://blog.swigartconsulting.com
"Charles Law" <bl***@nowhere.com> wrote in message
news:ed**************@TK2MSFTNGP10.phx.gbl...
Hi Herfried

It's not my luck day today for getting things to work. When I try to open
the AgilityPack solution I get two errors:

Unable to open project HtmlDomView
Unable to open project GetBinaryRemainder

When I try to run it comes up with 12 compile errors, one of which is a
cryptographic failure!! It seems that HtmlAgilityPack.snk is missing too.

Charles
"Herfried K. Wagner [MVP]" <hi***************@gmx.at> wrote in message
news:eJ**************@TK2MSFTNGP12.phx.gbl...
"Charles Law" <bl***@nowhere.com> schrieb:
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text

Maybe it's easier to use the HTML Agility Pack:

.NET Html Agility Pack: How to use malformed HTML just like it was
well-formed XML...
<URL:http://blogs.msdn.com/smourier/archive/2003/06/04/8265.aspx>

Download:

<URL:http://www.codefluent.com/smourier/download/htmlagilitypack.zip>

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>



Nov 21 '05 #7
Charles,

Maybe I can point you on a class that is called MSHTML. It is not the nicest
class, however very good to filter tags from a document using loops or even
tag by tag by looping through the document something like this, this is a
document collection.

\\\
For Each iDocument As mshtml.IHTMLDocument2 In pDocuments
For i As Integer = 0 To iDocument.all.length - 1
Dim hrefname As String
Dim hElm As mshtml.IHTMLElement = DirectCast(iDocument.all.item(i),
mshtml.IHTMLElement)
Dim tagname As String = hElm.tagName.ToLower
If (tagname = "a") Or (tagname = "chk") Then
If Not DirectCast(hElm, mshtml.IHTMLAnchorElement).href Is
Nothing Then
hrefname = DirectCast(hElm,
mshtml.IHTMLAnchorElement).href.ToString
End If
End If
etc etc
///.
..
..
In this newsgroups I leave the answers about this mostly to somebody who has
by coincidence the same name as you, he is much longer and activer busy with
it than I.

Maybe you can search for his answers.

:-)))))

Cor
Nov 21 '05 #8
Now why didn't I think of that ;-) I shall look this fellow up, of whom you
speak, and see what he has to say on the matter.

I have now got the Agility Pack working. It is somewhat smaller than mshtml
and, I suspect, quicker.

It's actually quite good, and may well be better than the regex idea;
especially since I don't currently have a regex that works! I had thought
that, for a large file, regex would be quicker than mshtml, but I have no
actual evidence of that. Conversely, though, I think that the Agility Pack
will be every bit as quick as a regex, if not quicker. Anyway, it works,
which is the main thing.

Charles
"Cor Ligthert" <no************@planet.nl> wrote in message
news:Oc**************@TK2MSFTNGP14.phx.gbl...
Charles,

Maybe I can point you on a class that is called MSHTML. It is not the
nicest class, however very good to filter tags from a document using loops
or even tag by tag by looping through the document something like this,
this is a document collection.

\\\
For Each iDocument As mshtml.IHTMLDocument2 In pDocuments
For i As Integer = 0 To iDocument.all.length - 1
Dim hrefname As String
Dim hElm As mshtml.IHTMLElement = DirectCast(iDocument.all.item(i),
mshtml.IHTMLElement)
Dim tagname As String = hElm.tagName.ToLower
If (tagname = "a") Or (tagname = "chk") Then
If Not DirectCast(hElm, mshtml.IHTMLAnchorElement).href Is
Nothing Then
hrefname = DirectCast(hElm,
mshtml.IHTMLAnchorElement).href.ToString
End If
End If
etc etc
///.
.
.
In this newsgroups I leave the answers about this mostly to somebody who
has by coincidence the same name as you, he is much longer and activer
busy with it than I.

Maybe you can search for his answers.

:-)))))

Cor

Nov 21 '05 #9
Charles,

"Charles Law" <bl***@nowhere.com> schrieb:
I have now got the Agility Pack working. It is somewhat smaller than
mshtml and, I suspect, quicker.


I am glad to hear that you finally got the Agility Pack to work :-).

--
M S Herfried K. Wagner
M V P <URL:http://dotnet.mvps.org/>
V B <URL:http://classicvb.org/petition/>

Nov 21 '05 #10
Charles Law wrote:
I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an array
like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags


Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

--

www.midnightbeach.com
Nov 21 '05 #11
Charles,
In addition to the other comments.

Rather then attempt to coerce Regex into parsing HTML, have you considered
using an HTML parser/reader such as the SgmlReader?

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...
| Does anyone have a regex pattern to parse HTML from a stream?
|
| I have a well structured file, where each line is of the form
|
| <sometag someattribute='attr'>text</sometag>
|
| for example
|
| <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| Just some text, without tags
|
| What I would like to be able to do is parse each line so that I get an
array
| like this
|
| SPAN
| CLASS
| myclass
| A bit of text
|
| or
|
| Just some text, without tags
|
| The array bit should follow, but I don't profess to be a regex expert (or
| any kind of expert for that matter). Can anyone help with a suitable
| pattern?
|
| TIA
|
| Charles
|
|
Nov 21 '05 #12
Hi Jon

As with my reply to an earlier response, it looks like the expression you
have given is specific to a given tag and attribute (unless I have
misunderstood the syntax), whereas I am looking for something to parse _any_
tag and attribute. Although the tags I am parsing are limited in number, it
would still be too onerous to create multiple expressions to compare with.

Thanks for the suggestion.

Charles
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Charles Law wrote:
I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array
like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags


Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

--

www.midnightbeach.com

Nov 21 '05 #13
Hi Jay

I have just had a look at the link, and it is similar, I think, to the
Agility Pack. Now that I have the Agility Pack working I am going to try and
make that work for me, unless a regex comes up. I think the code to use a
regex would be shorter/simpler, but of course that does not necessarily
equate with speed, and that is my overriding concern (well, that and
reliability, of course).

Charles
"Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in message
news:eo**************@TK2MSFTNGP09.phx.gbl...
Charles,
In addition to the other comments.

Rather then attempt to coerce Regex into parsing HTML, have you considered
using an HTML parser/reader such as the SgmlReader?

http://www.gotdotnet.com/Community/U...4-C3BD760564BC

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...
| Does anyone have a regex pattern to parse HTML from a stream?
|
| I have a well structured file, where each line is of the form
|
| <sometag someattribute='attr'>text</sometag>
|
| for example
|
| <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| Just some text, without tags
|
| What I would like to be able to do is parse each line so that I get an
array
| like this
|
| SPAN
| CLASS
| myclass
| A bit of text
|
| or
|
| Just some text, without tags
|
| The array bit should follow, but I don't profess to be a regex expert
(or
| any kind of expert for that matter). Can anyone help with a suitable
| pattern?
|
| TIA
|
| Charles
|
|

Nov 21 '05 #14
Charles Law wrote:
As with my reply to an earlier response, it looks like the expression you
have given is specific to a given tag and attribute (unless I have
misunderstood the syntax), whereas I am looking for something to parse _any_
tag and attribute. Although the tags I am parsing are limited in number, it
would still be too onerous to create multiple expressions to compare with.


You misread. ?<attribute> &c captures to the named group "attribute" -
it doesn't match "attribute".

You should try it. I spent five minutes writing it for you for free.
Assuming it's always attrib='value', and never attrib="value",

// ExplicitCapture | Multiline | IgnorePatternWhitespace

^
(
< (?<tag>\w+) \s+
(?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$


--

www.midnightbeach.com
Nov 21 '05 #15
Jon

I apologise if I appeared dismissive of your efforts. I have tried it with

<SPAN CLASS='result'>Hello world<SPAN>

and it collects elements perfectly. I tried it with

<SPAN>Hello world<SPAN>

and it collects everything in bare_text. Is there a way to make it still
collect in the designated fields?

Thanks again.

Charles
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Charles Law wrote:
As with my reply to an earlier response, it looks like the expression
you
have given is specific to a given tag and attribute (unless I have
misunderstood the syntax), whereas I am looking for something to parse
_any_
tag and attribute. Although the tags I am parsing are limited in number,
it
would still be too onerous to create multiple expressions to compare
with.


You misread. ?<attribute> &c captures to the named group "attribute" -
it doesn't match "attribute".

You should try it. I spent five minutes writing it for you for free.
> > Assuming it's always attrib='value', and never attrib="value",
> >
> > // ExplicitCapture | Multiline | IgnorePatternWhitespace
> >
> > ^
> > (
> > < (?<tag>\w+) \s+
> > (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' \s* >
> > (?<text>.*) </ \k<tag> >
> > ) .*
> > |
> > (?<bare_text> .+)
> > $


--

www.midnightbeach.com

Nov 21 '05 #16
Charles Law wrote:

Jon

I apologise if I appeared dismissive of your efforts. I have tried it with

<SPAN CLASS='result'>Hello world<SPAN>

and it collects elements perfectly. I tried it with

<SPAN>Hello world<SPAN>

and it collects everything in bare_text. Is there a way to make it still
collect in the designated fields?
Of course. But you said everything would look like

<sometag someattribute='attr'>text</sometag>

or bare text. Try

#[ExplicitCapture|Multiline|IgnorePatternWhitespace]

^
(
<
(?<tag>\w+)
(\s+ (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' )? \s*

(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

--

www.midnightbeach.com
Nov 21 '05 #17
Jon

As we say in these parts, you know stuff.

Thanks muchly.

Charles
"Jon Shemitz" <jo*@midnightbeach.com> wrote in message
news:42***************@midnightbeach.com...
Charles Law wrote:

Jon

I apologise if I appeared dismissive of your efforts. I have tried it
with

<SPAN CLASS='result'>Hello world<SPAN>

and it collects elements perfectly. I tried it with

<SPAN>Hello world<SPAN>

and it collects everything in bare_text. Is there a way to make it still
collect in the designated fields?


Of course. But you said everything would look like

<sometag someattribute='attr'>text</sometag>

or bare text. Try

#[ExplicitCapture|Multiline|IgnorePatternWhitespace]

^
(
<
(?<tag>\w+)
(\s+ (?<attribute>\w+) \s* = \s* ' (?<value>[^']*) ' )? \s*
>

(?<text>.*) </ \k<tag> >
) .*
|
(?<bare_text> .+)
$

--

www.midnightbeach.com

Nov 21 '05 #18
> I have a well structured file

If you can guarantee that the file will always be well-formed, you can use System.Xml namespace classes to do the parsing for you.
i.e. XmlReader / XmlWriter / XmlDocument or any of the XPath readers/writers/document.

--
Dave Sexton
dave@www..jwaonline..com
-----------------------------------------------------------------------
"Charles Law" <bl***@nowhere.com> wrote in message news:%2****************@TK2MSFTNGP15.phx.gbl...
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an array like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

The array bit should follow, but I don't profess to be a regex expert (or any kind of expert for that matter). Can anyone help
with a suitable pattern?

TIA

Charles

Nov 21 '05 #19
Hi Dave

Actually, you have hit on something there. I write the file in the first
place as HTML, but I could write it as XML, but use HTML tags. I would then
have the right class structure to read it back in. Marvellous. It pays to
look outside the box.

Thanks.

Charles
"Dave" <NO*********@dotcomdatasolutions.com> wrote in message
news:%2****************@TK2MSFTNGP09.phx.gbl...
I have a well structured file


If you can guarantee that the file will always be well-formed, you can use
System.Xml namespace classes to do the parsing for you. i.e. XmlReader /
XmlWriter / XmlDocument or any of the XPath readers/writers/document.

--
Dave Sexton
dave@www..jwaonline..com
-----------------------------------------------------------------------
"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP15.phx.gbl...
Does anyone have a regex pattern to parse HTML from a stream?

I have a well structured file, where each line is of the form

<sometag someattribute='attr'>text</sometag>

for example

<SPAN CLASS='myclass'>A bit of text</SPAN>, or
Just some text, without tags

What I would like to be able to do is parse each line so that I get an
array like this

SPAN
CLASS
myclass
A bit of text

or

Just some text, without tags

The array bit should follow, but I don't profess to be a regex expert (or
any kind of expert for that matter). Can anyone help with a suitable
pattern?

TIA

Charles


Nov 21 '05 #20
Charles,
NOTE: The SgmlTextReader I mentioned in my earlier post allows you to treat
any HTML as XML.

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
| Hi Dave
|
| Actually, you have hit on something there. I write the file in the first
| place as HTML, but I could write it as XML, but use HTML tags. I would
then
| have the right class structure to read it back in. Marvellous. It pays to
| look outside the box.
|
| Thanks.
|
| Charles
|
|
| "Dave" <NO*********@dotcomdatasolutions.com> wrote in message
| news:%2****************@TK2MSFTNGP09.phx.gbl...
| >> I have a well structured file
| >
| > If you can guarantee that the file will always be well-formed, you can
use
| > System.Xml namespace classes to do the parsing for you. i.e. XmlReader /
| > XmlWriter / XmlDocument or any of the XPath readers/writers/document.
| >
| > --
| > Dave Sexton
| > dave@www..jwaonline..com
| > -----------------------------------------------------------------------
| > "Charles Law" <bl***@nowhere.com> wrote in message
| > news:%2****************@TK2MSFTNGP15.phx.gbl...
| >> Does anyone have a regex pattern to parse HTML from a stream?
| >>
| >> I have a well structured file, where each line is of the form
| >>
| >> <sometag someattribute='attr'>text</sometag>
| >>
| >> for example
| >>
| >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| >> Just some text, without tags
| >>
| >> What I would like to be able to do is parse each line so that I get an
| >> array like this
| >>
| >> SPAN
| >> CLASS
| >> myclass
| >> A bit of text
| >>
| >> or
| >>
| >> Just some text, without tags
| >>
| >> The array bit should follow, but I don't profess to be a regex expert
(or
| >> any kind of expert for that matter). Can anyone help with a suitable
| >> pattern?
| >>
| >> TIA
| >>
| >> Charles
| >>
| >>
| >
| >
|
|
Nov 21 '05 #21
Charles,
| but I could write it as XML, but use HTML tags.

That would be XHTML ;-)

If you are writing the files, then this may be the way to go.

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
| Hi Dave
|
| Actually, you have hit on something there. I write the file in the first
| place as HTML, but I could write it as XML, but use HTML tags. I would
then
| have the right class structure to read it back in. Marvellous. It pays to
| look outside the box.
|
| Thanks.
|
| Charles
|
|
| "Dave" <NO*********@dotcomdatasolutions.com> wrote in message
| news:%2****************@TK2MSFTNGP09.phx.gbl...
| >> I have a well structured file
| >
| > If you can guarantee that the file will always be well-formed, you can
use
| > System.Xml namespace classes to do the parsing for you. i.e. XmlReader /
| > XmlWriter / XmlDocument or any of the XPath readers/writers/document.
| >
| > --
| > Dave Sexton
| > dave@www..jwaonline..com
| > -----------------------------------------------------------------------
| > "Charles Law" <bl***@nowhere.com> wrote in message
| > news:%2****************@TK2MSFTNGP15.phx.gbl...
| >> Does anyone have a regex pattern to parse HTML from a stream?
| >>
| >> I have a well structured file, where each line is of the form
| >>
| >> <sometag someattribute='attr'>text</sometag>
| >>
| >> for example
| >>
| >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| >> Just some text, without tags
| >>
| >> What I would like to be able to do is parse each line so that I get an
| >> array like this
| >>
| >> SPAN
| >> CLASS
| >> myclass
| >> A bit of text
| >>
| >> or
| >>
| >> Just some text, without tags
| >>
| >> The array bit should follow, but I don't profess to be a regex expert
(or
| >> any kind of expert for that matter). Can anyone help with a suitable
| >> pattern?
| >>
| >> TIA
| >>
| >> Charles
| >>
| >>
| >
| >
|
|
Nov 21 '05 #22
Hi Jay

You won't be surprised to hear that this is a continuing theme.

Once upon a time, there was RTF, but it was slow, and the people wept, for
it was very, very slow, and they got very, very bored waiting.

So, the developer chappie considered the many possible alternatives, and
decided to simplify the whole thing by invoking the minor devil known as the
listview. But the users came back and said, "but we liked the rich text box,
because it had colours and stuff".

And the developer said, "you have colours, what are you complaining about;
the listview is every bit as colourful, and quicker to boot, it just doesn't
retain the colours when you save and reload".

And then he added, "you are lucky to have anything at all, so just be
grateful", but he went away thinking that he had somehow done the users a
disservice.

So, anyway, he came up with the idea of saving the output as html, so that
it could be opened by the great God Microsoft Word; oh, and some browser
thingy called IE.

But then there was the dilemma: how to load it back into the application
with colour, as the users had become used to. And it was then that Regular
Expression came to the developer one night in a dream. But he knew little of
the Regular Expression, so he sought help from the great developers in the
sky. And they said, try this ... no, try this ... and he tried it, and it
worked; sought of.

But by this time, the developer had grown weary, and also his calculating
machine had become defective because he had done some re-installing and it
had mucked up his debugger, and it took him a day-and-a-half to put it
right. So, by Sunday evening he was really very weary indeed, and then some.

Finally, a door opened, and a bright light shone in. The developer tried
some stuff, and it worked. He wrote a set of classes to serialise and
de-serialise an html class, which looked remarkably like real html, which is
apparently something called xhtml.
So, now we are back in the present. The story is nearly at its end. The
developer just needs some sleep (and the love of a good women), and all will
be right with the world.

And so, to sleep, perchance to dream, ay there's the rub.

Charles
"Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in message
news:e9**************@TK2MSFTNGP15.phx.gbl...
Charles,
| but I could write it as XML, but use HTML tags.

That would be XHTML ;-)

If you are writing the files, then this may be the way to go.

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
| Hi Dave
|
| Actually, you have hit on something there. I write the file in the first
| place as HTML, but I could write it as XML, but use HTML tags. I would
then
| have the right class structure to read it back in. Marvellous. It pays
to
| look outside the box.
|
| Thanks.
|
| Charles
|
|
| "Dave" <NO*********@dotcomdatasolutions.com> wrote in message
| news:%2****************@TK2MSFTNGP09.phx.gbl...
| >> I have a well structured file
| >
| > If you can guarantee that the file will always be well-formed, you can
use
| > System.Xml namespace classes to do the parsing for you. i.e. XmlReader
/
| > XmlWriter / XmlDocument or any of the XPath readers/writers/document.
| >
| > --
| > Dave Sexton
| > dave@www..jwaonline..com
|
> -----------------------------------------------------------------------

| > "Charles Law" <bl***@nowhere.com> wrote in message
| > news:%2****************@TK2MSFTNGP15.phx.gbl...
| >> Does anyone have a regex pattern to parse HTML from a stream?
| >>
| >> I have a well structured file, where each line is of the form
| >>
| >> <sometag someattribute='attr'>text</sometag>
| >>
| >> for example
| >>
| >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| >> Just some text, without tags
| >>
| >> What I would like to be able to do is parse each line so that I get
an
| >> array like this
| >>
| >> SPAN
| >> CLASS
| >> myclass
| >> A bit of text
| >>
| >> or
| >>
| >> Just some text, without tags
| >>
| >> The array bit should follow, but I don't profess to be a regex expert
(or
| >> any kind of expert for that matter). Can anyone help with a suitable
| >> pattern?
| >>
| >> TIA
| >>
| >> Charles
| >>
| >>
| >
| >
|
|

Nov 21 '05 #23
Charles,
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good women), and all
will
| be right with the world.
Can't really help you on either of those... Other then wishing you luck in
those areas...
This question & the question on "Easiest way to generate XML in VB.NET" post
reminds me of Item #29 "Always Use a Parser" from Elliotte Rusty Harold's
book "Effective XML - 50 Specific Ways to Improve Your XML" from Addison
Wesley lists a number of other reasons to use a parser. Although Item #29 is
largely reading, I find the topic apropos to writing also. Hence my
suggestion, without realizing the connection, of using either the SgmlReader
or XHTML...

Hope this helps
Jay


"Charles Law" <bl***@nowhere.com> wrote in message
news:en**************@TK2MSFTNGP12.phx.gbl...
| Hi Jay
|
| You won't be surprised to hear that this is a continuing theme.
|
| Once upon a time, there was RTF, but it was slow, and the people wept, for
| it was very, very slow, and they got very, very bored waiting.
|
| So, the developer chappie considered the many possible alternatives, and
| decided to simplify the whole thing by invoking the minor devil known as
the
| listview. But the users came back and said, "but we liked the rich text
box,
| because it had colours and stuff".
|
| And the developer said, "you have colours, what are you complaining about;
| the listview is every bit as colourful, and quicker to boot, it just
doesn't
| retain the colours when you save and reload".
|
| And then he added, "you are lucky to have anything at all, so just be
| grateful", but he went away thinking that he had somehow done the users a
| disservice.
|
| So, anyway, he came up with the idea of saving the output as html, so that
| it could be opened by the great God Microsoft Word; oh, and some browser
| thingy called IE.
|
| But then there was the dilemma: how to load it back into the application
| with colour, as the users had become used to. And it was then that Regular
| Expression came to the developer one night in a dream. But he knew little
of
| the Regular Expression, so he sought help from the great developers in the
| sky. And they said, try this ... no, try this ... and he tried it, and it
| worked; sought of.
|
| But by this time, the developer had grown weary, and also his calculating
| machine had become defective because he had done some re-installing and it
| had mucked up his debugger, and it took him a day-and-a-half to put it
| right. So, by Sunday evening he was really very weary indeed, and then
some.
|
| Finally, a door opened, and a bright light shone in. The developer tried
| some stuff, and it worked. He wrote a set of classes to serialise and
| de-serialise an html class, which looked remarkably like real html, which
is
| apparently something called xhtml.
|
|
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good women), and all
will
| be right with the world.
|
| And so, to sleep, perchance to dream, ay there's the rub.
|
| Charles
|
|
| "Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in message
| news:e9**************@TK2MSFTNGP15.phx.gbl...
| > Charles,
| > | but I could write it as XML, but use HTML tags.
| >
| > That would be XHTML ;-)
| >
| > If you are writing the files, then this may be the way to go.
| >
| > Hope this helps
| > Jay
| >
| > "Charles Law" <bl***@nowhere.com> wrote in message
| > news:%2****************@TK2MSFTNGP12.phx.gbl...
| > | Hi Dave
| > |
| > | Actually, you have hit on something there. I write the file in the
first
| > | place as HTML, but I could write it as XML, but use HTML tags. I would
| > then
| > | have the right class structure to read it back in. Marvellous. It pays
| > to
| > | look outside the box.
| > |
| > | Thanks.
| > |
| > | Charles
| > |
| > |
| > | "Dave" <NO*********@dotcomdatasolutions.com> wrote in message
| > | news:%2****************@TK2MSFTNGP09.phx.gbl...
| > | >> I have a well structured file
| > | >
| > | > If you can guarantee that the file will always be well-formed, you
can
| > use
| > | > System.Xml namespace classes to do the parsing for you. i.e.
XmlReader
| > /
| > | > XmlWriter / XmlDocument or any of the XPath
readers/writers/document.
| > | >
| > | > --
| > | > Dave Sexton
| > | > dave@www..jwaonline..com
| > |
| >
-----------------------------------------------------------------------

| > | > "Charles Law" <bl***@nowhere.com> wrote in message
| > | > news:%2****************@TK2MSFTNGP15.phx.gbl...
| > | >> Does anyone have a regex pattern to parse HTML from a stream?
| > | >>
| > | >> I have a well structured file, where each line is of the form
| > | >>
| > | >> <sometag someattribute='attr'>text</sometag>
| > | >>
| > | >> for example
| > | >>
| > | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| > | >> Just some text, without tags
| > | >>
| > | >> What I would like to be able to do is parse each line so that I get
| > an
| > | >> array like this
| > | >>
| > | >> SPAN
| > | >> CLASS
| > | >> myclass
| > | >> A bit of text
| > | >>
| > | >> or
| > | >>
| > | >> Just some text, without tags
| > | >>
| > | >> The array bit should follow, but I don't profess to be a regex
expert
| > (or
| > | >> any kind of expert for that matter). Can anyone help with a
suitable
| > | >> pattern?
| > | >>
| > | >> TIA
| > | >>
| > | >> Charles
| > | >>
| > | >>
| > | >
| > | >
| > |
| > |
| >
| >
|
|
Nov 21 '05 #24
I have just spotted a Freudian slip
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good wom*e*n), and
all
Maybe there is something going on in my head that I don't know about ...
wouldn't be the first time.

I don't see any specific support for XHTML in .NET, unless it goes by
another name. I have my solution, using the XmlSerializer to serialise and
de-serialise a class hierarchy that resembles the html document I want to
manipulate. It requires that I name the classes quite carefully, and there
are some things that I cannot readily do, such as put comments
- -->) into a STYLE tag, but it works.

Have I missed a trick with this XHTML?

Charles
"Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in message
news:e%******************@TK2MSFTNGP12.phx.gbl... Charles,
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good women), and all
will
| be right with the world.
Can't really help you on either of those... Other then wishing you luck in
those areas...
This question & the question on "Easiest way to generate XML in VB.NET"
post
reminds me of Item #29 "Always Use a Parser" from Elliotte Rusty Harold's
book "Effective XML - 50 Specific Ways to Improve Your XML" from Addison
Wesley lists a number of other reasons to use a parser. Although Item #29
is
largely reading, I find the topic apropos to writing also. Hence my
suggestion, without realizing the connection, of using either the
SgmlReader
or XHTML...

Hope this helps
Jay


"Charles Law" <bl***@nowhere.com> wrote in message
news:en**************@TK2MSFTNGP12.phx.gbl...
| Hi Jay
|
| You won't be surprised to hear that this is a continuing theme.
|
| Once upon a time, there was RTF, but it was slow, and the people wept,
for
| it was very, very slow, and they got very, very bored waiting.
|
| So, the developer chappie considered the many possible alternatives, and
| decided to simplify the whole thing by invoking the minor devil known as
the
| listview. But the users came back and said, "but we liked the rich text
box,
| because it had colours and stuff".
|
| And the developer said, "you have colours, what are you complaining
about;
| the listview is every bit as colourful, and quicker to boot, it just
doesn't
| retain the colours when you save and reload".
|
| And then he added, "you are lucky to have anything at all, so just be
| grateful", but he went away thinking that he had somehow done the users
a
| disservice.
|
| So, anyway, he came up with the idea of saving the output as html, so
that
| it could be opened by the great God Microsoft Word; oh, and some browser
| thingy called IE.
|
| But then there was the dilemma: how to load it back into the application
| with colour, as the users had become used to. And it was then that
Regular
| Expression came to the developer one night in a dream. But he knew
little
of
| the Regular Expression, so he sought help from the great developers in
the
| sky. And they said, try this ... no, try this ... and he tried it, and
it
| worked; sought of.
|
| But by this time, the developer had grown weary, and also his
calculating
| machine had become defective because he had done some re-installing and
it
| had mucked up his debugger, and it took him a day-and-a-half to put it
| right. So, by Sunday evening he was really very weary indeed, and then
some.
|
| Finally, a door opened, and a bright light shone in. The developer tried
| some stuff, and it worked. He wrote a set of classes to serialise and
| de-serialise an html class, which looked remarkably like real html,
which
is
| apparently something called xhtml.
|
|
| So, now we are back in the present. The story is nearly at its end. The
| developer just needs some sleep (and the love of a good women), and all
will
| be right with the world.
|
| And so, to sleep, perchance to dream, ay there's the rub.
|
| Charles
|
|
| "Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in
message
| news:e9**************@TK2MSFTNGP15.phx.gbl...
| > Charles,
| > | but I could write it as XML, but use HTML tags.
| >
| > That would be XHTML ;-)
| >
| > If you are writing the files, then this may be the way to go.
| >
| > Hope this helps
| > Jay
| >
| > "Charles Law" <bl***@nowhere.com> wrote in message
| > news:%2****************@TK2MSFTNGP12.phx.gbl...
| > | Hi Dave
| > |
| > | Actually, you have hit on something there. I write the file in the
first
| > | place as HTML, but I could write it as XML, but use HTML tags. I
would
| > then
| > | have the right class structure to read it back in. Marvellous. It
pays
| > to
| > | look outside the box.
| > |
| > | Thanks.
| > |
| > | Charles
| > |
| > |
| > | "Dave" <NO*********@dotcomdatasolutions.com> wrote in message
| > | news:%2****************@TK2MSFTNGP09.phx.gbl...
| > | >> I have a well structured file
| > | >
| > | > If you can guarantee that the file will always be well-formed, you
can
| > use
| > | > System.Xml namespace classes to do the parsing for you. i.e.
XmlReader
| > /
| > | > XmlWriter / XmlDocument or any of the XPath
readers/writers/document.
| > | >
| > | > --
| > | > Dave Sexton
| > | > dave@www..jwaonline..com
| > |
| >
-----------------------------------------------------------------------

| > | > "Charles Law" <bl***@nowhere.com> wrote in message
| > | > news:%2****************@TK2MSFTNGP15.phx.gbl...
| > | >> Does anyone have a regex pattern to parse HTML from a stream?
| > | >>
| > | >> I have a well structured file, where each line is of the form
| > | >>
| > | >> <sometag someattribute='attr'>text</sometag>
| > | >>
| > | >> for example
| > | >>
| > | >> <SPAN CLASS='myclass'>A bit of text</SPAN>, or
| > | >> Just some text, without tags
| > | >>
| > | >> What I would like to be able to do is parse each line so that I
get
| > an
| > | >> array like this
| > | >>
| > | >> SPAN
| > | >> CLASS
| > | >> myclass
| > | >> A bit of text
| > | >>
| > | >> or
| > | >>
| > | >> Just some text, without tags
| > | >>
| > | >> The array bit should follow, but I don't profess to be a regex
expert
| > (or
| > | >> any kind of expert for that matter). Can anyone help with a
suitable
| > | >> pattern?
| > | >>
| > | >> TIA
| > | >>
| > | >> Charles
| > | >>
| > | >>
| > | >
| > | >
| > |
| > |
| >
| >
|
|

Nov 21 '05 #25
Charles,
| I don't see any specific support for XHTML in .NET
There is no specific support per se.

XHTML is HTML tags in an XML document.

Ergo the XHTML support in .NET is the classes System.Xml namespace, such as
the XmlSerializer. XmlSerializer directly or indirectly uses a
System.Xml.XmlWriter to write XML output. In other words it follows Item #29
& uses a "parser".

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
|I have just spotted a Freudian slip
|
| > | So, now we are back in the present. The story is nearly at its end.
The
| > | developer just needs some sleep (and the love of a good wom*e*n), and
| > all
|
| Maybe there is something going on in my head that I don't know about ...
| wouldn't be the first time.
|
| I don't see any specific support for XHTML in .NET, unless it goes by
| another name. I have my solution, using the XmlSerializer to serialise and
| de-serialise a class hierarchy that resembles the html document I want to
| manipulate. It requires that I name the classes quite carefully, and there
| are some things that I cannot readily do, such as put comments
| - -->) into a STYLE tag, but it works.
|
| Have I missed a trick with this XHTML?
|
| Charles
|
|
<<snip>>
Nov 21 '05 #26
Thanks for clearing that up. I think I have probably done the best with it
then

Cheers

Charles
"Jay B. Harlow [MVP - Outlook]" <Ja************@msn.com> wrote in message
news:eH**************@TK2MSFTNGP09.phx.gbl...
Charles,
| I don't see any specific support for XHTML in .NET
There is no specific support per se.

XHTML is HTML tags in an XML document.

Ergo the XHTML support in .NET is the classes System.Xml namespace, such
as
the XmlSerializer. XmlSerializer directly or indirectly uses a
System.Xml.XmlWriter to write XML output. In other words it follows Item
#29
& uses a "parser".

Hope this helps
Jay

"Charles Law" <bl***@nowhere.com> wrote in message
news:%2****************@TK2MSFTNGP12.phx.gbl...
|I have just spotted a Freudian slip
|
| > | So, now we are back in the present. The story is nearly at its end.
The
| > | developer just needs some sleep (and the love of a good wom*e*n),
and
| > all
|
| Maybe there is something going on in my head that I don't know about ...
| wouldn't be the first time.
|
| I don't see any specific support for XHTML in .NET, unless it goes by
| another name. I have my solution, using the XmlSerializer to serialise
and
| de-serialise a class hierarchy that resembles the html document I want
to
| manipulate. It requires that I name the classes quite carefully, and
there
| are some things that I cannot readily do, such as put comments
| - -->) into a STYLE tag, but it works.
|
| Have I missed a trick with this XHTML?
|
| Charles
|
|
<<snip>>

Nov 21 '05 #27

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
by: Keith Morris | last post by:
Hi all! I'm creating a mini CMS that will store content in a MySQL database. What I am trying to do is parse the content and replace certain keywords with a link. The keywords and associated...
23
by: Charles Law | last post by:
Does anyone have a regex pattern to parse HTML from a stream? I have a well structured file, where each line is of the form <sometag someattribute='attr'>text</sometag> for example <SPAN...
3
by: Bryan | last post by:
Hi All: I'm trying to find the right Regexp string to remove empty SPAN tags from an HTML string. Say I have a string like so, and I want to remove the empty span tags: <span>This is my...
5
by: Bradley Plett | last post by:
I'm hopeless at regular expressions (I just don't use them often enough to gain/maintain knowledge), but I need one now and am looking for help. I need to parse through a document to find a URL,...
1
by: Martin Andert | last post by:
Hello, i want to parse some html with regex and have the following problem: --- html to parse start --- some text <span class="x"> some text with linebreaks and tabs and <b>tags <i>in...
18
by: Q. John Chen | last post by:
I have Vidation Controls First One: Simple exluce certain special characters: say no a or b or c in the string: * Second One: I required date be entered in "MM/DD/YYYY" format: //+4 How...
4
by: rufus | last post by:
I need to parse some HTML and add links to some keywords (up to 1000) defined in a DB table. What I need to do is search for these keywords and if they are not already a link, and they are not...
5
by: Avi Kak | last post by:
Folks, Does regular expression processing in Python allow for executable code to be embedded inside a regular expression? For example, in Perl the following two statements $regex =...
20
by: Asper Faner | last post by:
I seem to always have hard time understaing how this regular expression works, especially how on earth do people bring it up as part of computer programming language. Natural language processing...
3
by: rupinderbatra | last post by:
Hello everyone, I am using a regular expression to parse a text string into various parts -- for ex: string "How do you do" will be changed to array with all the words and white spaces. I am...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
1
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
1
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...
0
by: 6302768590 | last post by:
Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated ...
1
muto222
php
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.