By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
448,952 Members | 1,119 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 448,952 IT Pros & Developers. It's quick & easy.

Hows my code

P: n/a
The following code is the heart of a program that I wrote to extract
html tags from a webpage. How efficient is my code ?. Is there still
possible way to optimize the code. Am I using everything as per the
text book. I am just apprehensive whether this may break or may cause a
memmory leak. Any chance for it.

#define TOKN_SIZE 256

void tagfinder() {

char ch, *tokens;
int i, j ,len;
i=j=0;
//scan buffer holds the webpage as a string.
len = strlen(scan_buffer);

while(i < len) {
ch = scan_buffer[i++];
if(ch == '<') {
tokens = malloc(TOKN_SIZE*sizeof(char));
j=0;
while(ch != '>') {
ch = scan_buffer[i++];
if(j >= TOKN_SIZE)
tokens = realloc(tokens, (j+TOKN_SIZE) * sizeof(char));
if(ch != '>') {
tokens[j++] = ch;
tokens[j] = '\0';
}

}// end of while(ch != '>')
printf("%s\n",tokens);
free(tokens);
}//end of if(ch == '<')
}//end of while(len > 0)

}

Nov 14 '05 #1
Share this Question
Share on Google+
10 Replies


P: n/a
saraca means ashoka tree wrote:
The following code is the heart of a program that I wrote to extract
html tags from a webpage. How efficient is my code ?. Is there still
possible way to optimize the code. Am I using everything as per the
text book. I am just apprehensive whether this may break or may cause a
memmory leak. Any chance for it.
[code snipped; see up-thread]


Before worrying about efficiency, worry about correctness.
You use malloc() and realloc() without checking for failure,
the way you use realloc() will cause a memory leak if realloc()
ever fails, the insertion of '\0' can run off the end of your
allocated region, an empty tag "<>" will leave you with a
non-string lacking the terminal '\0', and a '<' without a
matching '>' will send your code completely off the rails.

Once you've fixed these five bugs (and any others I didn't
happen to spot in your badly-indented code), you can start
measuring the performance of your program to see whether any
efficiency improvements are needed. Keep in mind that if it
takes you one hour to improve the speed by one millisecond,
you must run the program 3.6 million times just to break even.

If efficiency improvements are needed (as they well may be;
your code as it stands is far from tight), here are four
suggestions. Note that the C language itself has no notion of
"efficiency," so the actual effect of these suggestions will
vary from platform to platform. As a practical matter, all
four are likely to improve matters, but this is not guaranteed.
Again, you must measure.

Suggestion #1: Learn how to use the strchr() function,
because it can probably locate the '<' and '>' characters
faster than you can. Don't reinvent the wheel.

Suggestion #2: If all you need to do is print out the
substrings between '<' and '>', print them directly from the
source buffer and get rid of the malloc() and realloc() calls.
Learn how to use the "%.*s" format specification, or learn how
to use fwrite().

Suggestion #3: If your real program needs to store the
substrings somewhere instead of just printing them out, don't
allocate memory until you've located the closing '>' and know
how much space you'll need. This avoids wasting memory when
you get a short substring, and avoids the overhead of realloc()
when you get a long one.

Suggestion #4: Learn how to use the memcpy() function,
because it can probably copy characters from the big string
to your destination area faster than you can. (It will
almost certainly do better than your current practice of
storing most destination positions twice!) Don't reinvent
the wheel.

--
Er*********@sun.com

Nov 14 '05 #2

P: n/a

On Tue, 21 Dec 2004, saraca means ashoka tree wrote:

The following code is the heart of a program that I wrote to extract
html tags from a webpage. How efficient is my code ?. Is there still
possible way to optimize the code[?]
Of course.
Am I using everything as per the text book[?] I am just apprehensive
whether this may break or may cause a memory leak. Any chance for it[?]


Re-post your code with some indentation, and maybe someone will take
the trouble to look at it. Right now, it's completely unreadable.
(If your problem is that you're trying to post hard tabs to Usenet...
don't! I recommend
http://www.contrib.andrew.cmu.edu/~a...ftware/detab.c
, for obvious reasons. ;) Run 'detab -R -4 myprogram.c' and re-post.)

Make sure the text you're posting actually compiles, by the way.
What you posted in this message doesn't compile in either C90 or C99,
the languages discussed in this newsgroup. I strongly recommend you
don't use '//'-style comments in C code you intend to show anyone;
they tend to do Bad Things like

// this is a long comment that overflows the line and turns into a
syntax error

and every so often (though less and less frequently, thankfully) we see

file://this comment was mangled by a Windows-based news client

-Arthur
Nov 14 '05 #3

P: n/a
saraca means ashoka tree wrote:
The following code is the heart of a program that I wrote to extract
html tags from a webpage. How efficient is my code ?.
Better would be to not use malloc at all, and just remember
a pointer to the token start, and its length (or perhaps,
the offset into scan_buffer of the start and end).

The only reason to malloc would be if you needed to
destroy scan_buffer but keep the tokens, or if you
needed to pass the token to a function that cannot
handle a length-counted string (printf is not one of
those functions).

#define TOKN_SIZE 256

void tagfinder() {

char ch, *tokens;
int i, j ,len;
i=j=0;
//scan buffer holds the webpage as a string.
len = strlen(scan_buffer);
You forgot to include stdlib.h and string.h
while(i < len) {
ch = scan_buffer[i++];
if(ch == '<') {
tokens = malloc(TOKN_SIZE*sizeof(char));
sizeof(char) is always 1.
You need to check malloc's return value. It will return NULL
if you have run out of memory.
j=0;
while(ch != '>') {
ch = scan_buffer[i++];
if(j >= TOKN_SIZE)
tokens = realloc(tokens, (j+TOKN_SIZE) * sizeof(char));
You need to check realloc's return value.
Also, if you run out of memory then realloc will return NULL
and leak the original buffer. So to avoid leaks in the case
of a memory shortage you need to so something like:
temp = realloc(.......);
if (!temp) { free(tokens); exit(EXIT_FAILURE); }
tokens = temp;

This is also bad design because you realloc one char
at a time. So if your token was 1024 in length then
you will end up doing 1024 - TOKN_SIZE allocations.
You could at least increase the allocation size by
TOKN_SIZE each time.

Even better would be to count the length of the token
before you do any allocations at all. Then you only
need to allocate once (a memory allocation is orders
of magnitude slower than scanning a token twice).
if(ch != '>') {
tokens[j++] = ch;
tokens[j] = '\0';
}
You overflowed 'tokens'. For example, if j == TOKN_SIZE-1
then tokens[j++]=ch sets the last character to ch,
and then tokens[j]=0 writes past the end of the buffer.

This is inefficient anyway because you write a \0 every
time. You should only write the \0 once, after the
token is finished.

}// end of while(ch != '>')
printf("%s\n",tokens);
free(tokens);
}//end of if(ch == '<')
}//end of while(len > 0)

}


Nov 14 '05 #4

P: n/a
I am not certain whether the program only exists to extract and store
tags, or if this is just a single piece of a web-browser or such.
If it is part of a larger application, I would recommend not to store
the strings. If you set up constants:

#define NO_TAG -1
#define TAG_HTML 0
#define TAG_HEAD 1
#define TAG_BODY 2
#define TAG_P 3

or such, then you can store your information in a tree structure that
you can enumerate and switch each tag:

int currentTag;
while ((currentTag = getNext(tagTree)) != NO_TAG) {
/* tagTree might be some sort of tree structure that retains the
hierarchy of the tags */

switch (currentTag) {
case TAG_HTML:
....code...
break;
case TAG_HEAD:
....code...

etc. ...
}
}

which will be far easier to deal with than strings and faster when it
comes to actually doing something based on what kind of tag it is.
Otherwise you will need to be performing string compares everywhere
through your code.

-Chris Williams

Nov 14 '05 #5

P: n/a
Try using Crystal REVS for C (www.sgvsarc.com) to take care of
indentation and formatting. Its flowcharts will help you review your
code.

Nov 14 '05 #6

P: n/a
"Jason Taylor" <ja***********@yahoo.com> writes:
Try using Crystal REVS for C (www.sgvsarc.com) to take care of
indentation and formatting. Its flowcharts will help you review your
code.


If you don't mind paying $399 for a single license (slightly
discounted for multiple licenses) -- and it seems to be for Windows
only.

Jason, you wouldn't happen to work for the company that sells this
thing, would you?

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #7

P: n/a
Keith, Why make a personal remark?

I just happen to be a very satisfied user of Crystal REVS for C. I
think it's a great product!

Allow me to ask a few simple questions:

Have you tried using it? You can download a free eval copy.

Its flowcharts and automatic formatting have saved me a lot of time.

Would you rather do all the low-level editing manually? Would you
rather have code that is hard to read?

Would you rather spend more time to understand a function instead of
using flowcharts?

How much do you value your time? How does $399 compare?

Did you take a look at Crystal FLOW for C? It costs $219 - single
quantity.

A Windows only product seems fine to me. There must be at least a few
hundred thousand individuals who design with C/C++ primarily in
Windows.

Nov 14 '05 #8

P: n/a
"Jason Taylor" <ja***********@yahoo.com> writes:
Keith, Why make a personal remark?
Because you came into this newsgroup (where you haven't posted
previously as far as I can tell) and posted what appear to be
advertisements for a commercial product. I've seen a lot of
advertisements posing as testimonials from satisfied customers.
I just happen to be a very satisfied user of Crystal REVS for C. I
think it's a great product!
Ok, I'll take your word for it. Even so, the topicality of your posts
is questionable, but I'm not going to make a big deal out of it.
Allow me to ask a few simple questions:

Have you tried using it? You can download a free eval copy.
No. Since I don't develop C software under Windows, I don't have much
use for it.
Its flowcharts and automatic formatting have saved me a lot of time.
I don't like flowcharts, but de gustibus et cetera.
Would you rather do all the low-level editing manually? Would you
rather have code that is hard to read?


Yes and no, respectively.

[...]

--
Keith Thompson (The_Other_Keith) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <*> <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Nov 14 '05 #9

P: n/a
On 24 Dec 2004 12:01:01 -0800, in comp.lang.c , "Jason Taylor"
<ja***********@yahoo.com> wrote:
A Windows only product seems fine to me.
But offtopic here - this is a platfom neutral group.
There must be at least a few
hundred thousand individuals who design with C/C++ primarily in
Windows.


So what? Theres a billion speak various dialects of indian, but thats not
topical here either.
--
Mark McIntyre
CLC FAQ <http://www.eskimo.com/~scs/C-faq/top.html>
CLC readme: <http://www.ungerhu.com/jxh/clc.welcome.txt>
Nov 14 '05 #10

P: n/a
saraca means ashoka tree wrote:
The following code is the heart of a program that I wrote to extract
html tags from a webpage. How efficient is my code ?. Is there still
possible way to optimize the code.
...
len = strlen(scan_buffer);

while(i < len) {
ch = scan_buffer[i++];


while (ch = scan_buffer[i++])
{
Nov 14 '05 #11

This discussion thread is closed

Replies have been disabled for this discussion.