473,396 Members | 2,018 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Request for comments on HTML tag removal function

Hi All,

Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do differently
or any mistakes etc. I'm still kinda new to C, so I'm not 100% confident
using pointers yet.

Anyway, the algorithm works like this: The loop steps over the string
character by character with two pointers, I have a toggle variable that
basically indicates whether the 's' pointer is currently within an HTML tag.
If this is the case, 's' is incremented, but 'c' isn't. If 's' isn't inside
an HTML tag, the value pointed to by 'c' is set to the value pointed to by
's' and both are incremented. So basically, the string is being rebuilt in
place, skipping over html tags and their content.

Here's the code, all comments welcome:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t=1;
} else if (*s=='>') {
t=0;
} else if (!t) {
*(c++)=*s;
}
s++;
}
*c='\0';
}
Cheers.
~Kieran Simkin
Digital Crocus
http://digital-crocus.com/
Nov 14 '05 #1
4 3745
"Chris McDonald" <ch***@budgie.csse.uwa.edu.au> wrote in message
news:20*************************@budgie.csse.uwa.e du.au...
In comp.lang.c you write:
Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do
differently or any mistakes etc. I'm still kinda new to C, so I'm not 100%
confident using pointers yet.


Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?


That's a very good point, I've now made the following modification to my
code:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t++;
} else if (*s=='>') {
t--;
} else if (t<1) {
*(c++)=*s;
}
s++;
}
*c='\0';
}

Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.

Anything other comments on this code?
~Kieran
Nov 14 '05 #2
"Kieran Simkin" <ki****@digital-crocus.com> wrote in message news:<RN*************@newsfe1-gui.ntli.net>...
Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?

[SNIP] Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.


Does that work? e.g.

<!-- Comment <!-- --> <a>

The <a> isn't in a comment. HTML comments (like C /* */ comments) don't nest.

Right?
Nov 14 '05 #3

On Thu, 19 Aug 2004, Kieran Simkin wrote:

"Chris McDonald" <ch***@budgie.csse.uwa.edu.au> wrote...
In comp.lang.c you write:
Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do
differently or any mistakes etc. I'm still kinda new to C, so I'm not 100%
confident using pointers yet.
Have to worry about the nasty ones, too.
What if an HTML tag appears in an HTML comment?


That's a very good point, I've now made the following modification to my
code:

[...] Now instead of toggling 't' on and off, t becomes a counter for the depth of
nested HTML tags and the string is only copied if the depth is less than
one, ie, we're inside of less than one HTML tag.


(According to another poster, HTML comments don't nest.)

What about quoted text?

<img src="rewind.png" alt="<<">

What about <pre> tags?

<pre>if (i>0) break; if (i<0) continue;</pre>

And then once you get your code to parse standard HTML, it's still a
good idea to do something semi-sensible with non-standard HTML.

<table border=1 bgcolor=FFFFFF
<tr><td>Hello, world!
</table>

:) But that's a question for comp.programming or comp.text.html
(if such a group exists; I don't think it does. Oh, well). In fact,
even your original question was kind of OT here---since you weren't
asking "does this code meet the spec," but rather "what kind of spec
should I make up for this code?" ;)

HTH,
-Arthur
Nov 14 '05 #4
On Thu, 19 Aug 2004 07:20:13 GMT, "Kieran Simkin"
<ki****@digital-crocus.com> wrote:
Hi All,

Just writing a quick function to remove HTML tags from a string (array of
chars) and I'd like your comments on my code - anything you'd do differently
or any mistakes etc. I'm still kinda new to C, so I'm not 100% confident
using pointers yet.

Anyway, the algorithm works like this: The loop steps over the string
character by character with two pointers, I have a toggle variable that
basically indicates whether the 's' pointer is currently within an HTML tag.
If this is the case, 's' is incremented, but 'c' isn't. If 's' isn't inside
an HTML tag, the value pointed to by 'c' is set to the value pointed to by
's' and both are incremented. So basically, the string is being rebuilt in
place, skipping over html tags and their content.

Here's the code, all comments welcome:

void striphtml (char *s) {
char *c,t=0;
c=s;
while (*s!='\0') {
if (*s=='<') {
t=1;
} else if (*s=='>') {
t=0;
} else if (!t) {
*(c++)=*s;
}
s++;
}
*c='\0';
}
Cheers.
~Kieran Simkin
Digital Crocus
http://digital-crocus.com/

You don't handle '<' characters not intended to open a tag (like in "
"x < 7" or "y > 1").

Sure, theoretically there should be a "&lt;" instead of '<', but... if
you parse «real world» html pages with that function, for most pages
it will strip important parts of the text.
Take another approach: check for '<', then jump into a function that
checks if it is a tag or not.
If the '<' doesn't open a tag, then that function can return 0. if it
does open a tag, the function may return the position of the tag end
'>' relative to the position of the '<' (call it the string size of
the tag).
Here is another problem you will have to consider:
<a href="foo.html" onclick='if(x > 5) return false'>, you will end up
with the ---» 5) return false'> «--- left over in the stripped text.
Nov 14 '05 #5

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

0
by: Jon Roland | last post by:
I have a number of changes I like to make to HTML files that are not currently supported by HTML Tidy. Most of them arise from OCR recognition errors, and many from the ways my OCR program,...
5
by: Paxton | last post by:
I created an html email containing a form whose method is POST. The form is posted to an asp page for processing, but no values are retrieved. So I response.write all the Request.Form fields, and...
39
by: Timex | last post by:
I want to delete all comments in .c file. Size of .c file is very big. Any good idea to do this? Please show me example code.
35
by: michael.casey | last post by:
The purpose of this post is to obtain the communities opinion of the usefulness, efficiency, and most importantly the correctness of this small piece of code. I thank everyone in advance for your...
6
by: Ammar | last post by:
Dear All, I'm facing a small problem. I have a portal web site, that contains articles, for each article, the end user can send a comment about the article. The problem is: I the comment length...
6
by: Lex Hider | last post by:
Hi, Apologies if this is against etiquette. I've just got my first python app up and running. It is a podcast aggregator depending on feedparser. I've really only learnt enough to get this up and...
4
by: Joel Andres Granados | last post by:
Hi list: I have run across a situation with ConfigParser Module. It refers to the comments in the configuration filed. According to the http://docs.python.org/dev/lib/module-ConfigParser.html...
2
by: beatTheDevil | last post by:
Hey guys, As the title says I'm trying to make a regular expression (regex/regexp) for use in removing the comments from code. In this case, this particular regex is meant to match /* ... */...
5
by: Henry Stock | last post by:
I am trying to understand the following error: Any thing you can tell me about this is appreciated. Security Exception Description: The application attempted to perform an operation not allowed...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.