xml file parsing in C

Marc Dubois

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space
I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.

Dec 12 '06 #1

Subscribe Reply

2503

Clever Monkey

Marc Dubois wrote:

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :

[...]

2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Pretty much OT for this group. Try a newsgroup that deals with POSIX
tools, or try "man sed".

<state:off-topic>
Also, look at XML tidy/validation tools. HTML tidy has limited XML support.
</>

Dec 12 '06 #2

james of tucson

Marc Dubois wrote:

hi,
is it possible to parse an XML file in C

Of course it is "possible." Is it easy?
Depends on your experience writing parsers.
The XML grammar is not especially complicated
-- that's sort of the point of it.

If you are willing to take a canned solution, there is expat for C
http://www.jclark.com/xml/expat.html

However, your problem seems to be formatting and error correction, not
XML parsing. For example,

<fooblabla < bla </foo>

Is not XML.

<foo>blablabl a </foo>

Is not XML

2) Remove all extra spaces at the end of every line of the XML file

You don't need anything but an address of a char array and '\0' to do
that :-)

3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

This part might be an interesting problem.

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,

Right, so you realize this, and you realize that an XML parser will
simply choke on it and (maybe) tell you where the errors are :-)

it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Hopefully, it will emit a diagnostic message ...

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

You could use a lookahead technique since you always know what you want
to match. The naive approach I'd start with, would be to work the
tokens from the outer extremes to inner, maybe making a pass first just
to validate that the angle brackets all match up.

>
Thanks for your help.

I replied to your post because I work in a Java environment, and I
realize I am spoiled. Doing XML in java is too simple to warrant much
discussion. Doing an XML parser in C, on the other hand, from scratch,
would be a very interesting problem.

After considering it for about half a second, I'd look into the
difficulty level of using the Xerces-C++ library in a C app. Or the
XML::Parser perl module.

I realize you want to feed it invalid XML and correct errors; I know
from experience that you can use Xerces to a certain extent to locate
errors, so it might not be terribly hard to take that approach - make
passes through the xerces validator to find errors, fix them, and end up
with the ability to do SAX or DOM on the document for free.

I have never, ever, even considered touching Xerces-C++, so I don't know
if it has anything in common with Xerces-Java. The docs on the xerces
site make it look easy enough to use.

Somebody out there has done this, right?

Dec 12 '06 #3

Rob Hoelz

Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">" signs
a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser react
when it encounters a < that does not correspond to the beginning of a
tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.

Dec 12 '06 #4

Marc Dubois

i dont know PErl
"Rob Hoelz" <ho***@wisc.edu wrote in message
news:2006121217 2417.133beba0@T heRing...

Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:

>hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">" signs
a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser react
when it encounters a < that does not correspond to the beginning of a
tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.

Dec 12 '06 #5

Rob Hoelz

It's a good language; I'd consider learning it if I were you.

"Marc Dubois" <no@spam.comwro te:

i dont know PErl
"Rob Hoelz" <ho***@wisc.edu wrote in message
news:2006121217 2417.133beba0@T heRing...
Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:

hi,
is it possible to parse an XML file in C so that i can fulfill
these requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">"
signs a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser
react when it encounters a < that does not correspond to the
beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.

--
-Rob Hoelz

Dec 12 '06 #6

Default User

Rob Hoelz wrote:

Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >

Dec 12 '06 #7

Richard

"Default User" <de***********@ yahoo.comwrites :

Rob Hoelz wrote:

>Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >

Lecturing on top posting is OT.

Dec 13 '06 #8

John F

Richard wrote:

"Default User" <de***********@ yahoo.comwrites :

>Rob Hoelz wrote:

>>Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or
interspersed
with properly trimmed quotes. See the majority of other posts in
the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >

Lecturing on top posting is OT.

It's somwehow ironic but: so is lecturing on OT :-)

--
Johannes
You can have it:
Quick, Accurate, Inexpensive.
Pick two.

Dec 13 '06 #9

Keith Thompson

"John F" <sp**@127.0.0.1 writes:

Richard wrote:
>"Default User" <de***********@ yahoo.comwrites :

>>Rob Hoelz wrote:

Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or
intersperse d
with properly trimmed quotes. See the majority of other posts in
the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >

Lecturing on top posting is OT.

It's somwehow ironic but: so is lecturing on OT :-)

By convention, meta-discussions about topicality are considered
topical.

In my opinion, discussions about how to post properly should also be
considered topical. If nobody ever complained about top-posting, we'd
end up with an ugly mixture of top-posting, bottom-posting,
mid-posting, and whatever other forms of posting some random person
decides Looks Really Cool. The newsgroup will become more difficult
to read, and those who spend the most time here will lose patience and
give up on the newsgroup. Since spending a lot of time here
correlates fairly strongly (but not perfectly) with expertise, I
suggest that this would be to the great detriment of the newsgroup.

Personally, I *usually* don't complain about top-posting unless I
happen to be replying to the article anyway.

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style *and* discuss C need no such tag.)

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.

Dec 13 '06 #10

Similar topics

1924

Which is the better way to parse this file?

by: Roberto A. F. De Almeida | last post by:

Hi, I'm interested in parsing a file containing this "structure": """dataset { int catalog_number; sequence { string experimenter; int32 time; structure {

Python

3428

Recursive SQL in a events 10046 trace file

by: Oxmard | last post by:

Armed with my new O'Reilly book Optimizing Oracle Performance I have been trying to get a better understanding of how Oracle works. The book makes the statement, " A database cal with dep=n + 1 is the recursive child of the first subsequent dep=n database call listed in the SQL data stream. The book gives a few examples, and in trying it out it seemed to work until I tried the following SQL. My question are why does this not keep with...

Oracle Database

3958

XML file parsing/validating with xerces-j

by: Cigdem | last post by:

Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in

.NET Framework

1542

Is there any way to read strings from a file

by: H | last post by:

Now, I'm here with another newbie question .... I want to read a text file, string by string (to do some things with some words etc etc), but I can't seem to find a way to do this String by String. Is there anyway, like String s = something.ReadString() ? Or what may be a fine way to do this ? Only thing I can some up with is to read 1 char at a time, and look if the next char is a space-sign, and that way "make" the Strings myself....

C# / C Sharp

10288

Validation of XML file against external XSD Schema using Xerces CDT

by: christian.eickhoff | last post by:

Hi Everyone, I am currently implementing an XercesDOMParser to parse an XML file and to validate this file against its XSD Schema file which are both located on my local HD drive. For this purpose I set the corresponding XercesDOMParser feature as shown in the upcoming subsection of my code. As far as I understand, the parsing process should throw an DOMException in case the XML file doesn't match the Schema file (e.g. Element...

.NET Framework

3794

SAX parser result to XML file

by: baskarpr | last post by:

Hi all, I my program after parsing in SAX parser, I want to write the parse result as an XML file. I want to ensure that there should be no difference between source XML file and parse result xml file. Because I set some properties in parser, which may cause to changes between actual and parsed. What I expect is the exact XML file structure is to be available into another XML file (incl white spc's) after SAX parsing. Below is a snippet...

Java

64651

How to parse a file in C++

by: AdrianH | last post by:

Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming. FYI Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make...

C / C++

64188

How to Parse a File in C

by: AdrianH | last post by:

Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make...

C / C++

2851

File Pointers - geting a logical problem..

by: souravmallik | last post by:

Hello, I'm facing a big logical problem while writing a parser in VC++ using C. I have to parse a file in a chunk of bytes in a round robin fashion. Means, when I select a file, the parser will read first 512kb(IBUFFSIZE) of data, then move to next file and parse the same way. This way I can parse a number of file spreaded over different directory uniformly. I'm keeping a meta data in a file where I'm keeping the track of file parse...

C / C++

3615

HTML File Parsing

by: Felipe De Bene | last post by:

I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the error :

Python

9489

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...

General

9298

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...

Windows Server

10072

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...

C / C++

9906

Maximizing Business Potential: The Nexus of Website Design and Digital Marketing

by: jinu1996 | last post by:

In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...

Online Marketing

9885

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...

Windows Server

8737

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...

Career Advice

7286

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...

Microsoft Access / VBA

6562

Couldn’t get equations in html when convert word .docx file to html file in C#.

by: conductexam | last post by:

I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...

C# / C Sharp

5329

Windows Forms - .Net 8.0

by: adsilva | last post by:

A Windows Forms form does not have the event Unload, like VB6. What one acts like?

Visual Basic .NET