473,698 Members | 2,145 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

can I know how to write a html parser in C

Hi

I am fairly familiar in C but not much.

I want to know how I can write a html parser in C that only parses for
the image file in the html file and display or print
all the images found in the html file.

How to go about it?

Should I have a file pointer and store the html file into an array
first and then look for the img src..
like do some string compare...

Is there a sample on the net(not a hifi code,, a simple one) that I can
look at to give me an idea on what I need to do.

Thanks again

Nov 14 '05
14 3136
Walter Roberson wrote:
State 2: recognize and discard whitespace (including newline).
When you get the first non-whitespace character, then if you had
no whitespace or if tolower(charact er) is not 'h' then transit to state 4
else transit to state 5 <snip> State 5: you have recognized up to "<img h". recognize and accept
characters that match "ref=\"" and then enter url acceptance mode;
if you hit something else, go to state 4

<snip>

Just a slight nitpick on a seemingly good text(I have no idea about the subject
myself, so can't really say anything about the quality of the text :)
I was under the impression that image URLs were stored in the src attribute, and
not the href one. :) Easy to switch anyways.

Nov 14 '05 #11
On 23 Feb 2005 13:00:29 -0800, WUV999U
<us************ **@gmail.com> wrote:
Hi

I am fairly familiar in C but not much.

I want to know how I can write a html parser in C that only
parses for the image file in the html file and display or print
all the images found in the html file.

How to go about it?

Should I have a file pointer and store the html file into an
array first and then look for the img src.. like do some string
compare...

Is there a sample on the net(not a hifi code,, a simple one)
that I can look at to give me an idea on what I need to do.

Thanks again


If you use linux/unix, something like this could work:

----------

#!/bin/sh

$tmp=$HOME/images.html

echo "<HTML><HEAD><T ITLE>Images</TITLE></HEAD><BODY>" >> $tmp

wget -O - http://www.foo.com/whatever.foo | sed -n "s/\(.*\)\
\(<[iI][mM][gG] [sS][rR][cC]="[^>]*">\)\(.*\)/\
<P>\2/p" >> $tmp

echo "</BODY></HTML>" >> $tmp

--------
AC

Nov 14 '05 #12
In article <gV************ ********@news2. e.nsc.no>,
Daniel Bruce <ir*****@gmail. com> wrote:
:Walter Roberson wrote:
:> State 5: you have recognized up to "<img h". recognize and accept
:> characters that match "ref=\"" and then enter url acceptance mode;

:I was under the impression that image URLs were stored in the src attribute, and
:not the href one. :) Easy to switch anyways.

You are right, I was thinking of anchors when I wrote that.
--
WW{Backus,Churc h,Dijkstra,Knut h,Hollerith,Tur ing,vonNeumann} D ?
Nov 14 '05 #13
/**************H TML PARSER********* ****/

void htmlparse(FILE *);

int main(int argc, char * argv[])
{

FILE * op;
op = fopen(argv[1],"r");
if (op == NULL)
{
printf("Error opening file\n");
exit(0);
}

htmlparse(op);
return 1;

}

void htmlparse(FILE * op)
{
char line[81];
char images[250];
if (fgets(line,81, op) == NULL)
{
printf("Error reading data");
exit(0);
}

puts(line);

if(line == "<img src")
{

------------------
well,, thats all i hav......... and m stuck here...

Nov 14 '05 #14
In article <11************ **********@z14g 2000cwz.googleg roups.com>,
WUV999U <us************ **@gmail.com> wrote:
: op = fopen(argv[1],"r");

argv[1] might be NULL. You should be checking that you have the right
number of parameters before you use any of them.

:void htmlparse(FILE * op)
:{
: char line[81];

Are the lines truly limited to 80 characters of text? It is not
at all uncommon to encounter HTML in which the lines go on for
several hundred characters.

: char images[250];

That declares a single character array named 'images' with a maximum
null-terminated character string size of 249 characters. However,
since you are only fetching 80 characters per line, the maximum
image file name you are going to be able to extract is about 68
characters (once you remove the tag and quotes.)

If you want to allow for 250 images, then you should be declaring
either an array of char * pointers or else a "two dimensional"
array of characters.

: if (fgets(line,81, op) == NULL)

There's that magic number again, 81. Any time you have a number whose
meaning is not obvious and which is repeated, you should either
use a #define or store the value in a variable [which would have
implications on how you would write the code.]

: {
: printf("Error reading data");
: exit(0);
: }

Eventually you are going to run out of input and get NULL returned.
That isn't an error: it is a signal that your function should
finish up and return. As you have named the function 'htmlparse',
the reader would tend to assume that -all- the function does is
parse the input and extract certain information from it, but would
not act upon that information, so the reader would tend to assume
that you would return the list of images to the calling routine
and let it do whatever should be done with the list.

: puts(line);

Why do you need to output the line at that point? The input file
isn't going anywhere, so you are unlikely to need to duplicate the
input.
: if(line == "<img src")

That is never going to be true. That is going to compare the
*address* of the string "<img src" to the address of the character
array 'line'. Since "<img src" is a literal string, it is not going
to have the same address as your buffer.

You also cannot fix this just by using strcmp() instead of testing
the pointer: you need to be looking inside the line to find a place
on the line (not necessarily at the beginning) where the string
"<img src" occurs. Try strstr(). But watch out for comments and
for the possibility that you might be within a quoted string...

Note too that in the general case it is perfectly acceptable in HTML
for there to be a linebreak between the "<img" and "src". Are you
working with a very restricted subset of HTML? If so then it would
help a lot to describe what the subset is. Some HTML subsets are
very easy to parse, whereas HTML in general is fairly complex to
parse.
:well,, thats all i hav......... and m stuck here...

Ekkk!

No offense intended but you really haven't gotten very far
at all and have made a number of mistakes in what you posted.
Looking at this, we would tend to conclude that you are very
much a beginner at C (and possibly a beginner at programming
in general). Parsing general HTML is something that requires a
fair bit of experience to program correctly; if what you posted
is indeed representative of your C skills then you have no hope of
writing a generalized HTML img file name extractor in any reasonable
amount of time. Even a well-experienced programmer would take more
than "a day or two" to write a proper HTML parser from scratch.

[Of course, a well-experience programmer would know to *not*
write it from scratch if it could be avoided: there are a number
of already-written HTML parser libraries out there, and there
are programs such as "lynx" which could be canablized. Writing
from scratch would usually be reserved for instances in which
there were notable copyright or patent issues at stake.]
--
IEA408I: GETMAIN cannot provide buffer for WATLIB.
Nov 14 '05 #15

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

7
2595
by: YoBro | last post by:
Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form field beside name. I want to get the following but not sure how to modify the code below. 1. Field Name (to appear beside NAME:) 2. Field Type (to appear beside TYPE:)
0
2715
by: Himanshu Garg | last post by:
Hello, I am using HTML::Parser to extract text from html pages from http://bbc.co.uk/urdu/ However the encoding of the input text seems to change to some unknown encoding in the output. The program is given below. The HTML is in a string to keep the example simple. The same problem appears with HTML in a file.
3
3125
by: Himanshu Garg | last post by:
Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory (/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/HTML/) doesn't seem to contain the "real" parsing statements.
6
12979
by: Mike Daniel | last post by:
I am attempting to use document.write(pageVar) that displays a new html page within a pop-up window and the popup is failing. Also note that pageVar is a complete HTML page containing other java scripts. Being a javascript newbie and after significant testing, I suspect that the document.write fails after finding a </script> within pageVar. Does a trick exist that enables one to slightly alter pageVar whereby enabling...
12
3337
by: Radek Maciaszek | last post by:
Hi It's very interesting problem. I couldn't even find any inforamtion about it on the google. I think that the best way of explain will be this simple example: <html> <body> <script language="JavaScript" type="text/javascript" src="write.js"></script>
12
2856
by: Sean | last post by:
Hi, I have the following script: ----------------------------------------------------------------------------------- <script type="text/javaScript"> <!-- document.write('<div id=hello1>Hello1</div>'); document.write('<div id=hello2 style="display:none;"><script src="test.js"><\/script></div>');
82
6310
by: Eric Lindsay | last post by:
I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility of getting wrong it seems I am on very shaky ground . For example, pretty much every book and web course on html that I have read tells me I must include <html>, <head> and <body> tag pairs. I have always done that, and never questioned it. ...
0
1994
by: june | last post by:
Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile = "file:\\C:/work/Eclipse3.1.1/html-file.html"; org.xml.sax.InputSource pSource = new InputSource(aHTMLFile);
6
7376
by: Herby | last post by:
Hi, Im interested in Reverse Engineering C++ source code into a form more comprehensible than the source itself. I want to write a basic one myself, obviously i need to write a parser for the source code. Although this has some overlap with say a compiler it would also seem significantly different too.
0
8676
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9029
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
8898
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8870
tracyyun
by: tracyyun | last post by:
Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...
0
7734
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
0
5860
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
4370
by: TSSRALBI | last post by:
Hello I'm a network technician in training and I need your help. I am currently learning how to create and manage the different types of VPNs and I have a question about LAN-to-LAN VPNs. The last exercise I practiced was to create a LAN-to-LAN VPN between two Pfsense firewalls, by using IPSEC protocols. I succeeded, with both firewalls in the same network. But I'm wondering if it's possible to do the same thing, with 2 Pfsense firewalls...
2
2332
muto222
by: muto222 | last post by:
How can i add a mobile payment intergratation into php mysql website.
3
2006
bsmnconsultancy
by: bsmnconsultancy | last post by:
In today's digital era, a well-designed website is crucial for businesses looking to succeed. Whether you're a small business owner or a large corporation in Toronto, having a strong online presence can significantly impact your brand's success. BSMN Consultancy, a leader in Website Development in Toronto offers valuable insights into creating effective websites that not only look great but also perform exceptionally well. In this comprehensive...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.