can I know how to write a html parser in C - Page 2

WUV999U

Hi

I am fairly familiar in C but not much.

I want to know how I can write a html parser in C that only parses for
the image file in the html file and display or print
all the images found in the html file.

How to go about it?

Should I have a file pointer and store the html file into an array
first and then look for the img src..
like do some string compare...

Is there a sample on the net(not a hifi code,, a simple one) that I can
look at to give me an idea on what I need to do.

Thanks again

Nov 14 '05

Subscribe Reply

3142

Daniel Bruce

Walter Roberson wrote:

State 2: recognize and discard whitespace (including newline).
When you get the first non-whitespace character, then if you had
no whitespace or if tolower(charact er) is not 'h' then transit to state 4
else transit to state 5 <snip> State 5: you have recognized up to "<img h". recognize and accept
characters that match "ref=\"" and then enter url acceptance mode;
if you hit something else, go to state 4

<snip>

Just a slight nitpick on a seemingly good text(I have no idea about the subject
myself, so can't really say anything about the quality of the text :)
I was under the impression that image URLs were stored in the src attribute, and
not the href one. :) Easy to switch anyways.

Nov 14 '05 #11

Alan Connor

On 23 Feb 2005 13:00:29 -0800, WUV999U
<us************ **@gmail.com> wrote:

Hi

I am fairly familiar in C but not much.

I want to know how I can write a html parser in C that only
parses for the image file in the html file and display or print
all the images found in the html file.

How to go about it?

Should I have a file pointer and store the html file into an
array first and then look for the img src.. like do some string
compare...

Is there a sample on the net(not a hifi code,, a simple one)
that I can look at to give me an idea on what I need to do.

Thanks again

If you use linux/unix, something like this could work:

----------

#!/bin/sh

$tmp=$HOME/images.html

echo "<HTML><HEAD><T ITLE>Images</TITLE></HEAD><BODY>" >> $tmp

wget -O - http://www.foo.com/whatever.foo | sed -n "s/$.*$\
$<[iI][mM][gG] [sS][rR][cC]="[^>]*">$$.*$/\
<P>\2/p" >> $tmp

echo "</BODY></HTML>" >> $tmp

--------
AC

Nov 14 '05 #12

Walter Roberson

In article <gV************ ********@news2. e.nsc.no>,
Daniel Bruce <ir*****@gmail. com> wrote:
:Walter Roberson wrote:
:> State 5: you have recognized up to "<img h". recognize and accept
:> characters that match "ref=\"" and then enter url acceptance mode;

:I was under the impression that image URLs were stored in the src attribute, and
:not the href one. :) Easy to switch anyways.

You are right, I was thinking of anchors when I wrote that.
--
WW{Backus,Churc h,Dijkstra,Knut h,Hollerith,Tur ing,vonNeumann} D ?

Nov 14 '05 #13

WUV999U

/**************H TML PARSER********* ****/

void htmlparse(FILE *);

int main(int argc, char * argv[])
{

FILE * op;
op = fopen(argv[1],"r");
if (op == NULL)
{
printf("Error opening file\n");
exit(0);
}

htmlparse(op);
return 1;

}

void htmlparse(FILE * op)
{
char line[81];
char images[250];
if (fgets(line,81, op) == NULL)
{
printf("Error reading data");
exit(0);
}

puts(line);

if(line == "<img src")
{

------------------
well,, thats all i hav......... and m stuck here...

Nov 14 '05 #14

Walter Roberson

In article <11************ **********@z14g 2000cwz.googleg roups.com>,
WUV999U <us************ **@gmail.com> wrote:
: op = fopen(argv[1],"r");

argv[1] might be NULL. You should be checking that you have the right
number of parameters before you use any of them.

:void htmlparse(FILE * op)
:{
: char line[81];

Are the lines truly limited to 80 characters of text? It is not
at all uncommon to encounter HTML in which the lines go on for
several hundred characters.

: char images[250];

That declares a single character array named 'images' with a maximum
null-terminated character string size of 249 characters. However,
since you are only fetching 80 characters per line, the maximum
image file name you are going to be able to extract is about 68
characters (once you remove the tag and quotes.)

If you want to allow for 250 images, then you should be declaring
either an array of char * pointers or else a "two dimensional"
array of characters.

: if (fgets(line,81, op) == NULL)

There's that magic number again, 81. Any time you have a number whose
meaning is not obvious and which is repeated, you should either
use a #define or store the value in a variable [which would have
implications on how you would write the code.]

: {
: printf("Error reading data");
: exit(0);
: }

Eventually you are going to run out of input and get NULL returned.
That isn't an error: it is a signal that your function should
finish up and return. As you have named the function 'htmlparse',
the reader would tend to assume that -all- the function does is
parse the input and extract certain information from it, but would
not act upon that information, so the reader would tend to assume
that you would return the list of images to the calling routine
and let it do whatever should be done with the list.

: puts(line);

Why do you need to output the line at that point? The input file
isn't going anywhere, so you are unlikely to need to duplicate the
input.
: if(line == "<img src")

That is never going to be true. That is going to compare the
*address* of the string "<img src" to the address of the character
array 'line'. Since "<img src" is a literal string, it is not going
to have the same address as your buffer.

You also cannot fix this just by using strcmp() instead of testing
the pointer: you need to be looking inside the line to find a place
on the line (not necessarily at the beginning) where the string
"<img src" occurs. Try strstr(). But watch out for comments and
for the possibility that you might be within a quoted string...

Note too that in the general case it is perfectly acceptable in HTML
for there to be a linebreak between the "<img" and "src". Are you
working with a very restricted subset of HTML? If so then it would
help a lot to describe what the subset is. Some HTML subsets are
very easy to parse, whereas HTML in general is fairly complex to
parse.
:well,, thats all i hav......... and m stuck here...

Ekkk!

No offense intended but you really haven't gotten very far
at all and have made a number of mistakes in what you posted.
Looking at this, we would tend to conclude that you are very
much a beginner at C (and possibly a beginner at programming
in general). Parsing general HTML is something that requires a
fair bit of experience to program correctly; if what you posted
is indeed representative of your C skills then you have no hope of
writing a generalized HTML img file name extractor in any reasonable
amount of time. Even a well-experienced programmer would take more
than "a day or two" to write a proper HTML parser from scratch.

[Of course, a well-experience programmer would know to *not*
write it from scratch if it could be avoided: there are a number
of already-written HTML parser libraries out there, and there
are programs such as "lynx" which could be canablized. Writing
from scratch would usually be reserved for instances in which
there were notable copyright or patent issues at stake.]
--
IEA408I: GETMAIN cannot provide buffer for WATLIB.

Nov 14 '05 #15

Similar topics

2597

Help with a regular expression

by: YoBro | last post by:

Hi I have used some of this code from the PHP manual, but I am bloody hopeless with regular expressions. Was hoping somebody could offer a hand. The output of this will put the name of a form field beside name. I want to get the following but not sure how to modify the code below. 1. Field Name (to appear beside NAME:) 2. Field Type (to appear beside TYPE:)

PHP

2718

Erroneous Text Extraction using HTML::Parser

by: Himanshu Garg | last post by:

Hello, I am using HTML::Parser to extract text from html pages from http://bbc.co.uk/urdu/ However the encoding of the input text seems to change to some unknown encoding in the output. The program is given below. The HTML is in a string to keep the example simple. The same problem appears with HTML in a file.

Perl

3125

Where to look for source of HTML::Parser

by: Himanshu Garg | last post by:

Hello, I am trying to pinpoint an apparent bug in HTML::Parser. The encoding of the text seems to change incorrectly if the locale isn't set properly. However Parser.pm in the directory (/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/HTML/) doesn't seem to contain the "real" parsing statements.

Perl

12983

Help: Imbedded </script> within a document.write()

by: Mike Daniel | last post by:

I am attempting to use document.write(pageVar) that displays a new html page within a pop-up window and the popup is failing. Also note that pageVar is a complete HTML page containing other java scripts. Being a javascript newbie and after significant testing, I suspect that the document.write fails after finding a </script> within pageVar. Does a trick exist that enables one to slightly alter pageVar whereby enabling...

Javascript

3342

document.write and buffer data

by: Radek Maciaszek | last post by:

Hi It's very interesting problem. I couldn't even find any inforamtion about it on the google. I think that the best way of explain will be this simple example: <html> <body> <script language="JavaScript" type="text/javascript" src="write.js"></script>

Javascript

2863

document.write issue

by: Sean | last post by:

Hi, I have the following script: ----------------------------------------------------------------------------------- <script type="text/javaScript"> <!-- document.write('<div id=hello1>Hello1</div>'); document.write('<div id=hello2 style="display:none;"><script src="test.js"><\/script></div>');

Javascript

6334

Understanding simplest HTML page

by: Eric Lindsay | last post by:

I have been trying to get a better understanding of simple HTML, but I am finding conflicting information is very common. Not only that, even in what seemed elementary and without any possibility of getting wrong it seems I am on very shaky ground . For example, pretty much every book and web course on html that I have read tells me I must include <html>, <head> and <body> tag pairs. I have always done that, and never questioned it. ...

HTML / CSS

1995

No parsing-result of HTML into XHTML

by: june | last post by:

Hi, I have a big problem with parsing HTML into a XHTML using Cberneko to validate the html. First I tried to work with a HTML-File. This solutions works fine: String aHTMLFile = "file:\\C:/work/Eclipse3.1.1/html-file.html"; org.xml.sax.InputSource pSource = new InputSource(aHTMLFile);

XML

7381

C++ Source Reverse Engineer - How to write a parser ?

by: Herby | last post by:

Hi, Im interested in Reverse Engineering C++ source code into a form more comprehensible than the source itself. I want to write a basic one myself, obviously i need to write a parser for the source code. Although this has some overlap with say a compiler it would also seem significantly different too.

C / C++

9447

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9307

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9235

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

9181

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each protocol has its own unique characteristics and advantages, but as a user who is planning to build a smart home system, I am a bit confused by the choice of these technologies. I'm particularly interested in Zigbee because I've heard it does some...

General

8186

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

6031

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

4809

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET

3261

transfer the data from one system to another through ip address

by: 6302768590 | last post by:

Hai team i want code for transfer the data from one system to another through IP address by using C# our system has to for every 5mins then we have to update the data what the data is updated we have to send another system

C# / C Sharp

2721

How to add payments to a PHP MySQL app.

by: muto222 | last post by:

How can i add a mobile payment intergratation into php mysql website.

PHP