473,408 Members | 2,888 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,408 software developers and data experts.

Counting String tokens precisly in an html document

Hi there,

I have an html file like this.
----------------------------------------
<body>
<h1>Home Page</h1>
<p>
Welcome<br>To<br>

<br> My Home Page
</p>
--------------------------------------------

I want to know exact number of string tokens

it should discard the new lines (But completely discarding them will result
in merging the words seperated by new lines), discard too much white space
etc.

Please, some help will be appreciated.

I wrote this function initially, which only works with single white space.

public int count_body(string s)
{
char[] sp = {' '};
int count = s.Split(sp).Length;
return count;
}

Thanks!!
Nov 17 '05 #1
2 1258
kman,

You are better off using an HTML parser for something like this. You
can use MSHTML through interop (Microsoft's HTML parser), and then access
the innerText property to get just the text for the document. You can then
parse apart that text easily (it should be broken properly, even with the BR
tags in between, which you won't see in the innerText).

Hope this helps.
--
- Nicholas Paldino [.NET/C# MVP]
- mv*@spam.guard.caspershouse.com

<km**@yahoo.com> wrote in message news:nb********************@rogers.com...
Hi there,

I have an html file like this.
----------------------------------------
<body>
<h1>Home Page</h1>
<p>
Welcome<br>To<br>

<br> My Home Page
</p>
--------------------------------------------

I want to know exact number of string tokens

it should discard the new lines (But completely discarding them will
result in merging the words seperated by new lines), discard too much
white space etc.

Please, some help will be appreciated.

I wrote this function initially, which only works with single white space.

public int count_body(string s)
{
char[] sp = {' '};
int count = s.Split(sp).Length;
return count;
}

Thanks!!

Nov 17 '05 #2
Try using regular expressions

<km**@yahoo.com> wrote in message news:nb********************@rogers.com...
Hi there,

I have an html file like this.
----------------------------------------
<body>
<h1>Home Page</h1>
<p>
Welcome<br>To<br>

<br> My Home Page
</p>
--------------------------------------------

I want to know exact number of string tokens

it should discard the new lines (But completely discarding them will result in merging the words seperated by new lines), discard too much white space
etc.

Please, some help will be appreciated.

I wrote this function initially, which only works with single white space.

public int count_body(string s)
{
char[] sp = {' '};
int count = s.Split(sp).Length;
return count;
}

Thanks!!

Nov 17 '05 #3

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

5
by: Andy Mee | last post by:
Hello one and all, I'm developing an Asp.NET system to take a CSV file uploaded via the web, parse it, and insert the values into an SQL database. My sticking point comes when I try to split()...
10
by: Christopher Benson-Manica | last post by:
(if this is a FAQ, I apologize for not finding it) I have a C-style string that I'd like to cleanly separate into tokens (based on the '.' character) and then convert those tokens to unsigned...
12
by: Generic Usenet Account | last post by:
Is it that I am blurry eyed, or is it indeed that the C++ string class has no tokenizer method defined? I have defined my own functions , but I would prefer to use the standard functions, if...
14
by: Mike N. | last post by:
Hello: I have a form that contains a multiple-select field that has 12 options in it. I would like the user to be able to select UP TO FOUR of those options. If they select more than four, I...
3
by: Mikael Syska | last post by:
Hi, I'm reading Beginning Visual C-Sharp by wrox, great book by the way. In the book the describe how I can print, and it works great, but what if I has like 300 lines, that wont fit on a...
1
by: j | last post by:
Hi, I've been trying to do line/character counts on documents that are being uploaded. As well as the "counting" I also have to remove certain sections from the file. So, firstly I was working...
1
by: Jerry | last post by:
We have a 10-question quiz for kids, each question being a yes or no answer using radio selections. I'd like to keep a current total of yes's and no's at the bottom of the quiz (if the user selects...
7
by: Tempo | last post by:
Hello. I am having a little trouble extracting text from a string. The string that I am dealing with is pasted below, and I want to extract the prices that are contained in the string below. Thanks...
10
by: Bilal | last post by:
Hello, I'm trying to perform some string manipulations in my stylesheet and have gotten stuck on the issue below so hopefully can elicit some useful hints. Namely, the problem is that I need to...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.