Shelton/C# should be able to match my HTM_TXT.EXE .

Jeff_Relf

Hi Tom, You showed: <<
private const string PHONE_LIST =
"495.1000__424.1111___(206)564-5555_1.800.325.3333";

static void Main( string[] args ) {
foreach (string phoneNumber in Regex.Split (PHONE_LIST, "_+")) {
Console.WriteLine (phoneNumber); } }

Output:
495.1000
424.1111
(206)564-5555
1.800.325.3333 >>

Thanks Tom, that's very interesting,
but not enough to switch me away from LoopTo(),
RegEx simply isn't as flexible.

#define LoopTo( StopCond ) \
while ( Ch && ( Ch = ( uchar ) * ++ P ) \
&& ! ( Ch2 = ( uchar ) P [ 1 ], StopCond ) )

It's a very simple matter to convert HTML to plain text,
following these rules:

These are valid HTML tags: <! Comment --> <Alpha> </Alpha>
But, due to the leading space, < Alpha> is not.
Things like &Unknown are sent through untranslated, for obvious reasons.

Pass HTM_TXT.EXE a .HTML file and it spits out a .TXT file.
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.EXE
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.CPP
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.VCPROJ

If RegEx is as powerful as you say,
you should be able to produce something that works at least as well,
and which is just as readable, or more, to me.

Nov 17 '05

Subscribe Post Reply

3197

Richard Blewett [DevelopMentor]

Can you take this OT stuff of the .NET groups please

Regards

Richard Blewett - DevelopMentor
http://www.dotnetconsult.co.uk/weblog
http://www.dotnetconsult.co.uk

Nov 17 '05 #51

Richard Blewett [DevelopMentor]

Can you take this OT stuff of the .NET groups please

Regards

Richard Blewett - DevelopMentor
http://www.dotnetconsult.co.uk/weblog
http://www.dotnetconsult.co.uk

Nov 17 '05 #52

Stefan Simek

Jeff_Relf wrote:

Hi Stefan_Simek, The title of the post you replied to was: <>

Can you show me links like my AA.HTM, HTM_TXT.EXE and AA.TXT ?

This is the input, View_Source --> File --> Save_Page_As
http://www.Cotse.NET/users/jeffrelf/AA.HTM http://www.triaxis.sk/temp/AA.HTM This is what HTM_TXT.EXE outputs:
http://www.Cotse.NET/users/jeffrelf/AA.TXT http://www.triaxis.sk/temp/AA.TXT
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.EXE
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.CPP
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.VCPROJ http://www.triaxis.sk/temp/HTM_TXT.ZIP

Nov 17 '05 #53

Jeff_Relf

Re: http://www.triaxis.sk/temp/AA.TXT
Well, Howdy_Do_Dee, Stefan_Simek... well done !

But you're not preserving whitespace like I do,
you're simply collapsing whitespace,
so it doesn't handle the <pre> tag.

For example, HTM_TXT.EXE handles <pre> by turning this:
http://www.Cotse.NET/users/jeffrelf/index.htm
( View_Source --> File --> Save_Page_As )
into this:
http://www.Cotse.NET/users/jeffrelf/index.TXT

What I do is remove lines that have nothing but whitespace and tags
but leave all lines that had no tags, even if they're just blank lines.
I doubt that your RegEx could do that.

On a much more minor point,
your Â » and Â © are each using two 7-bit char UTF encoding,
which can be further decoded to single 8-bit chars.

Given that Ch is the name of the first 7-bit char and Ch2 is the second,
UTF decoding goes something like this:
if ( Ch == 194 || Ch == 195 ) && ( Ch2 & 0xC0 ) == 0x80 )
return ( Ch & 3 ) << 6 | Ch2 & 0x3F ;

Re: http://www.triaxis.sk/temp/HTM_TXT.ZIP

That's a lot more files/directories than I showed you,
which, in my opinion, makes your project much less readable/maintainable.

Re: Double_clicking: C:\__\Stefan_Simek\bin\Release\htm_txt.EXE

I got a message saying that I didn't have the .NET's 2.0.5 framework.
I'm running on a Win_XP system I bought about 6 months ago at Office_Depot.
I have MS_Office_XP and Visual_Studio_NET_2003 installed on it,
with whatever the default install did.

Re: C:\__\Stefan_Simek\Program.cs

Although I'm sure it's not true, my copy of Visual_Studio_NET_2003 tells me
there's a bunch of syntax errors in it, including mismatched braces {}.

Re:
Regex.Replace( input
, @"(?'entity'&((\w+)|(#[0-9]+)|(#x[0-9a-fA-F]+));?)|
(?'tag'<[/?]?\w+(([^\][^""])|(\""[^\][^""]*?\""))*?>)
|(?'comment')",
delegate ( Match m ) { // convert entities
if ( m.Groups["entity"].Success ) {
if ( m.Value == " ") return " ";
return System.Web.HttpUtility.HtmlDecode( m.Value ); }
if ( m.Groups[ "tag" ].Success && m.Value.ToLower() == " " )
return "\n";
// clear the rest
return ""; }, RegexOptions.Singleline );

Hmm... return System.Web.HttpUtility.HtmlDecode( m.Value ); ?
That's quite bizarre, nothing like C... but it works, I see.

Nov 17 '05 #54

Stefan Simek

Jeff_Relf wrote:

Re: http://www.triaxis.sk/temp/AA.TXT
Well, Howdy_Do_Dee, Stefan_Simek... well done !
Thx ;)

But you're not preserving whitespace like I do,
you're simply collapsing whitespace,
so it doesn't handle the <pre> tag.
??

For example, HTM_TXT.EXE handles <pre> by turning this:
http://www.Cotse.NET/users/jeffrelf/index.htm
( View_Source --> File --> Save_Page_As )
into this:
http://www.Cotse.NET/users/jeffrelf/index.TXT

What I do is remove lines that have nothing but whitespace and tags
but leave all lines that had no tags, even if they're just blank lines.
I doubt that your RegEx could do that.
Well, my conversion of your index.htm is byte-to-byte equivalent except
for two empty lines at the end of the document. And I see no special
treatment of the <pre> tag in your code, unless I'm blind...

On a much more minor point,
your Â » and Â © are each using two 7-bit char UTF encoding,
which can be further decoded to single 8-bit chars.
The output encoding is UTF-8 by default in .NET. You can change it to
anything else by changing the line

using (StreamWriter sw = new StreamWriter(args[1]))

to

using (StreamWriter sw = new StreamWriter(args[1], false,
Encoding.GetEncoding(1250_or_whatever)))

Given that Ch is the name of the first 7-bit char and Ch2 is the second,
UTF decoding goes something like this:
if ( Ch == 194 || Ch == 195 ) && ( Ch2 & 0xC0 ) == 0x80 )
return ( Ch & 3 ) << 6 | Ch2 & 0x3F ;

Re: http://www.triaxis.sk/temp/HTM_TXT.ZIP

That's a lot more files/directories than I showed you,
which, in my opinion, makes your project much less readable/maintainable.
???
I thought that VS6.0 for example generated .dsw, .dsp, .ncb and more
files as well. The *only* file required is the Program.cs. I've provided
a link to the .NET 1.1 source, build command and exe at the end.

Re: Double_clicking: C:\__\Stefan_Simek\bin\Release\htm_txt.EXE

I got a message saying that I didn't have the .NET's 2.0.5 framework.
I'm running on a Win_XP system I bought about 6 months ago at Office_Depot.
I have MS_Office_XP and Visual_Studio_NET_2003 installed on it,
with whatever the default install did.
Well, as the error message says, it requires the .NET framework 2.0.5
(beta2).

Re: C:\__\Stefan_Simek\Program.cs

Although I'm sure it's not true, my copy of Visual_Studio_NET_2003 tells me
there's a bunch of syntax errors in it, including mismatched braces {}.
Because anonymous delegates were not supported back in 1.1.

Re:
Regex.Replace( input
, @"(?'entity'&((\w+)|(#[0-9]+)|(#x[0-9a-fA-F]+));?)|
(?'tag'<[/?]?\w+(([^\][^""])|(\""[^\][^""]*?\""))*?>)
|(?'comment')",
delegate ( Match m ) { // convert entities
if ( m.Groups["entity"].Success ) {
if ( m.Value == " ") return " ";
return System.Web.HttpUtility.HtmlDecode( m.Value ); }
if ( m.Groups[ "tag" ].Success && m.Value.ToLower() == " " )
return "\n";
// clear the rest
return ""; }, RegexOptions.Singleline );

Hmm... return System.Web.HttpUtility.HtmlDecode( m.Value ); ?
I see no reason for writing my own entity parser as long as there's one
provided by the framework.
That's quite bizarre, nothing like C... but it works, I see.
Sure it's not like C. It's been a few years since 1978, and the ways of
programing have evolved by now...

See the following for a .NET 1.1 version, whith added command line
checking and exception handling:
http://www.kascomp.sk/tmp/htm_txt.cs
http://www.kascomp.sk/tmp/htm_txt.exe
http://www.kascomp.sk/tmp/build.bat

Nov 17 '05 #55

Jeff_Relf

Hi Stefan_Simek,

Your HTM_TXT.EXE preserves whitespace reasonably well,
as your Index.TXT was exactly the same, byte for byte, as mine.
As I tried to say before, my code does not look for the <pre> tag,
it just always preserves whitespace.

Re: using (StreamWriter sw = new StreamWriter(args[1], false
, Encoding.GetEncoding(1250))) sw.Write(output);

Well done !

Re: This .NET 1.1 stuff:
http://www.kascomp.sk/tmp/htm_txt.cs
http://www.kascomp.sk/tmp/htm_txt.exe
http://www.kascomp.sk/tmp/build.bat

Now that's much more like it, well done,
I liked that much more than the .ZIP you showed before.

Re: Your .NET framework 2.0.5 ( beta2 ) code with anonymous delegates,

That was very interesting, but it complicated the installation.

Re: System.Web.HttpUtility.HtmlDecode( m.Value );,

You wrote: <>

You have a point there, but I like playing with the lower_level stuff.

You wrote: << Sure it's not like C. It's been a few years since 1978,
and the ways of programing have evolved by now... >>

C# is a mutation, and a rather recent one at that,
it doesn't meet my needs... no #define.

Nov 17 '05 #56

Exciting Possibility

Jeff_Relf wrote:

Hi Stefan_Simek,

Your HTM_TXT.EXE preserves whitespace reasonably well,
Am I missing something here?

As a test I saved this page:

http://www.howstuffworks.com/search.php

as a text file.

I compiled htm_txt.cs in mono and ran it against search.php.html, it
produced a text file search.php.txt This output file has only one line
in it -- what appears to be the title.

Both the html source and txt output are attached.

as your Index.TXT was exactly the same, byte for byte, as mine.
As I tried to say before, my code does not look for the <pre> tag,
it just always preserves whitespace.

Re: using (StreamWriter sw = new StreamWriter(args[1], false
, Encoding.GetEncoding(1250))) sw.Write(output);

Well done !

Re: This .NET 1.1 stuff:
http://www.kascomp.sk/tmp/htm_txt.cs
http://www.kascomp.sk/tmp/htm_txt.exe
http://www.kascomp.sk/tmp/build.bat

Now that's much more like it, well done,
I liked that much more than the .ZIP you showed before.

Re: Your .NET framework 2.0.5 ( beta2 ) code with anonymous delegates,

That was very interesting, but it complicated the installation.

Re: System.Web.HttpUtility.HtmlDecode( m.Value );,

You wrote: <>

You have a point there, but I like playing with the lower_level stuff.

You wrote: << Sure it's not like C. It's been a few years since 1978,
and the ways of programing have evolved by now... >>

C# is a mutation, and a rather recent one at that,
it doesn't meet my needs... no #define.

HowStuffWorks - Search

Nov 17 '05 #57

Jeff_Relf

Hi John, Re: Your attempt to convert
http://www.howstuffworks.com/search.php

You can't just do a Save_Page_As, you Must do a View_Source first.

This is what the .HTM file should look like:
http://www.Cotse.NET/users/jeffrelf/BB.HTM
and my HTM_TXT.EXE translates it to this:
http://www.Cotse.NET/users/jeffrelf/BB.TXT

Notice that my HTM_TXT is faithful to the raw whitespace and the tag,
leaving/printing as many as was called for,
while Stefan_Simek's HTM_TXT won't allow more than two blank lines in a row.

But I had to radically modify my HTM_TXT to properly handle multilined tags:
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.EXE
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.CPP
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.VCPROJ

And, because HTM_TXT is merely a demo of code from X.CPP
( my custom e-mail client and newsreader ) these files were also affected:
http://www.Cotse.NET/users/jeffrelf/X.EXE
http://www.Cotse.NET/users/jeffrelf/X.CPP
http://www.Cotse.NET/users/jeffrelf/X.VCPROJ

Nov 17 '05 #58

Jeff_Relf

By the way John, Re:
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.CPP
http://www.kascomp.sk/tmp/htm_txt.cs

Both HTM_TXT.CPP and htm_txt.cs could be called obfuscated.
In fact, I consider my own code to be more readable,
( perhaps because I wrote it ).

Any code that does useful work is going to take time to understand.

Nov 17 '05 #59

Journey To The Center of The Earth

Jeff_Relf wrote:

Any code that does useful work is going to take time to understand.

Very true...which is why you should not dismiss all .NET code as being
written by "script kiddies".

Nov 17 '05 #60

Journey To The Center of The Earth

Jeff_Relf wrote:

But I had to radically modify my HTM_TXT to properly handle multilined tags: And, because HTM_TXT is merely a demo of code from X.CPP

I see, so you admit that all you do is write sample code...but
completely unextensible.

The c# folks just proved how inflexible c++ code is because you have to
set up a static starting point and then produce a result.

Whereas us c# people have been able to dynamically change our code very
quickly as you move the goal posts around to suit yourself.

Nov 17 '05 #61

Tom Shelton

In article <2s********************@speakeasy.net>, Journey To The Center of The Earth wrote:

Jeff_Relf wrote:
But I had to radically modify my HTM_TXT to properly handle multilined tags:

And, because HTM_TXT is merely a demo of code from X.CPP

I see, so you admit that all you do is write sample code...but
completely unextensible.

The c# folks just proved how inflexible c++ code is because you have to
set up a static starting point and then produce a result.

Whereas us c# people have been able to dynamically change our code very
quickly as you move the goal posts around to suit yourself.

That's why I gave up... Relf will never be satisfied. My attempt met
every one of the criteria that he laid out in the original post - except
for entity translation, and the only reason I didn't handle that was
because I was unclear on what he wanted to have happen (and I did ask
for clarification, but never did see a response). But, when he saw that
I did it in like 3-4 lines of code - suddenly, several more criteria are
added.

It was the same thing with the phone number parsing... With every post
of an answer he had to change the format.

Relf doesn't want to use C#. That's fine, we all make choices and he
has made his. I have no intention of producing another line of code for
him.

--
Tom Shelton

Nov 17 '05 #62

Jeff_Relf

Hi Tom_Shelton ( and Bellow ),
Re: The Piss_Poor job you did of converting HTML to plain text,

You thought you could match my HTM_TXT.CPP while your dishes dryed,
....how naive !

Not even Stefan_Simek could faithfully perserve blank lines
or honor all tags.

Try converting BB.HTM Tom:
http://www.Cotse.NET/users/jeffrelf/BB.HTM
Make it look like this:
http://www.Cotse.NET/users/jeffrelf/BB.TXT

http://www.Cotse.NET/users/jeffrelf/HTM_TXT.EXE
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.CPP
http://www.Cotse.NET/users/jeffrelf/HTM_TXT.VCPROJ

The best Simek could do was
to leave a lot of blank lines where the tags used to be
and then consolidate mulitple blank lines into one.

i.e. Simek's htm_txt.cs produces blank lines where I do Not want them,
and omits blank lines where I Do want them.

Does it surprize you that I didn't give you my full specs at first ?
It shouldn't... I didn't want to overwhelm you any more than I already was.

You told Bailo: << That's why I gave up... Relf will never be satisfied.
My attempt met every one of the criteria that he laid out
in the original post - except for entity translation... >>

Unlike Kelsey and Simek, you never figured out how to download an HTML page
( you must do a View_Source before the Save_Page_As ).

You wrote: << It was the same thing with the phone number parsing,
...With every post of an answer he had to change the format. >>

Right... you couldn't understand what my code was doing,
and, therefore, what I required of it.
Nor did you have the patience to have me explain it.
....No surprises there... huh ?

You concluded: << Relf doesn't want to use C#.
That's fine, we all make choices and he has made his.
I have no intention of producing another line of code for him. >>

Using COM and other bloatware is fine in a pinch ( ¡ ¡ ¡ Slo-o-ow ),
but it's not the hallmark of a serious coder.
....You're a Half_Assed coder, Shelton... end of story.

#define is too dangerous for C# kiddies such as you.

#define LOOP while ( 1 )

#define Loop( N ) int J = - 1, LLL = N ; while ( ++ J < LLL )

#define LoopTo( StopCond ) \
while ( Ch && ( Ch = ( uchar ) * ++ P ) \
&& ! ( Ch2 = ( uchar ) P [ 1 ], StopCond ) )

#define LoopXx( Xx ) Xx##P P = 0, B ; int J = -1 ; \
Xx##A BB = Xx.BB, EE = Xx.PP + 1, PP = BB - 1 ; \
if ( BB ) while ( ++ J, B = P = * ++ PP, PP < EE )

Nov 17 '05 #63

Jeff_Relf

Hi Bellow, You told me: <>

HTM_TXT.EXE demonstrates 47 lines of code from X.CPP,
and X is used daily be me, as it's the best e-mail client newsreader
I've never known... by miles.

And you know it's totally flexible, as you see how I change it all the time.

Nov 17 '05 #64

by: elmlish | last post by:

Hello all, I'm currently befuddled as to how to efficiently test for a positive re. match then use the results of that match in a function. Mostly what I've seen people do is to first test for...

Python

Regex Match Problem

by: bdwise | last post by:

I have this in my body tag: something();something(); document.thisForm.textBox1.focus();something(); And I want to find a part between the semicolons that ends in focus() and remove the...

Python

how to handle repetitive regexp match checks

by: Matt Wette | last post by:

Over the last few years I have converted from Perl and Scheme to Python. There one task that I do often that is really slick in Perl but escapes me in Python. I read in a text line from a file...

Python

Does 'match' cost performance?

by: Duane Morin | last post by:

I've inherited an XSL transform that I need to squeeze every last millisecond out of (since it's running several hundred thousand times). I've noticed that there are 26 match clauses in the file....

.NET Framework

Shelton/C# should be able to match my HTM_TXT.EXE .

by: Jeff_Relf | last post by:

Hi Tom, You showed: << private const string PHONE_LIST = "495.1000__424.1111___(206)564-5555_1.800.325.3333"; static void Main( string args ) { foreach (string phoneNumber in Regex.Split...

.NET Framework

Regex.Match Memory Leak

by: Jeff McPhail | last post by:

I am using Regex.Match in a large application and the memory is growing out of control. I have tried several ways to try and release the memory and none of them work. Here are some similar examples...

C# / C Sharp

Regex problem - Match.Index

by: hwait | last post by:

I've a string (got it via WebRequest): <td class=maintable noWrap bgcolor=#666666>HDR A </td> .... Inner text A .... <td class=maintable noWrap...

C# / C Sharp

Something I don't understand about VB.Net

by: John Dann | last post by:

I'm looking at switching from VB6 to .Net but there's one key aspect that I can't get my head around: As I understand it, anyone wanting to run an app developed under VB.Net needs the .Net...

Visual Basic .NET

Match height of greater column

by: dlite922 | last post by:

This might be a javascript problem, but I want to see if there's any way in CSS to do it first. I have two floating divs, one wide on the left contains the "content" of the page, the second...

HTML / CSS

How do I get a partial match?

by: vorlonfear | last post by:

I have been working on this for a while now and I wanted to see if someone could assist me. I have 2 tables each with 5 fields. 4 of the fields are 2 character strings, then the final field is the...

Microsoft Access / VBA

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

Online Marketing

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

Similar topics