473,757 Members | 9,145 Online
Bytes | Software Development & Data Engineering Community
+ Post

Home Posts Topics Members FAQ

xml file parsing in C

hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space
I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.

Thanks for your help.
Dec 12 '06 #1
24 2503
Marc Dubois wrote:
hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space, e.g. :
[...]
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space

I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.
Pretty much OT for this group. Try a newsgroup that deals with POSIX
tools, or try "man sed".

<state:off-topic>
Also, look at XML tidy/validation tools. HTML tidy has limited XML support.
</>
Dec 12 '06 #2
Marc Dubois wrote:
hi,
is it possible to parse an XML file in C
Of course it is "possible." Is it easy?
Depends on your experience writing parsers.
The XML grammar is not especially complicated
-- that's sort of the point of it.

If you are willing to take a canned solution, there is expat for C
http://www.jclark.com/xml/expat.html

However, your problem seems to be formatting and error correction, not
XML parsing. For example,
<fooblabla < bla </foo>
Is not XML.
<foo>blablabl a </foo>
Is not XML
2) Remove all extra spaces at the end of every line of the XML file
You don't need anything but an address of a char array and '\0' to do
that :-)
3) Replace all special characters ( Unicode or Hexadecimal characters) by a
space
This part might be an interesting problem.
I mean the XML file is not well formed if there are "<" and ">" signs a
little bit everywhere,
Right, so you realize this, and you realize that an XML parser will
simply choke on it and (maybe) tell you where the errors are :-)
it is not a valid file in that case, so i do not think the use of a parser
would be appropriate in that case. (How would the parser react when it
encounters a < that does not correspond to the beginning of a tag ???)
Hopefully, it will emit a diagnostic message ...

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of the extra
spaces and replace the special characters but i still do not know how to
deal with the extra ">" and "<" signs.
You could use a lookahead technique since you always know what you want
to match. The naive approach I'd start with, would be to work the
tokens from the outer extremes to inner, maybe making a pass first just
to validate that the angle brackets all match up.

>
Thanks for your help.
I replied to your post because I work in a Java environment, and I
realize I am spoiled. Doing XML in java is too simple to warrant much
discussion. Doing an XML parser in C, on the other hand, from scratch,
would be a very interesting problem.

After considering it for about half a second, I'd look into the
difficulty level of using the Xerces-C++ library in a C app. Or the
XML::Parser perl module.

I realize you want to feed it invalid XML and correct errors; I know
from experience that you can use Xerces to a certain extent to locate
errors, so it might not be terribly hard to take that approach - make
passes through the xerces validator to find errors, fix them, and end up
with the ability to do SAX or DOM on the document for free.

I have never, ever, even considered touching Xerces-C++, so I don't know
if it has anything in common with Xerces-Java. The docs on the xerces
site make it look easy enough to use.

Somebody out there has done this, right?
Dec 12 '06 #3
Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:
hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">" signs
a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser react
when it encounters a < that does not correspond to the beginning of a
tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.

Dec 12 '06 #4
i dont know PErl
"Rob Hoelz" <ho***@wisc.edu wrote in message
news:2006121217 2417.133beba0@T heRing...
Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:
>hi,
is it possible to parse an XML file in C so that i can fulfill these
requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">" signs
a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser react
when it encounters a < that does not correspond to the beginning of a
tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.


Dec 12 '06 #5
It's a good language; I'd consider learning it if I were you.

"Marc Dubois" <no@spam.comwro te:
i dont know PErl
"Rob Hoelz" <ho***@wisc.edu wrote in message
news:2006121217 2417.133beba0@T heRing...
Just curious, why do you want to use C for this? I'm not bashing C,
(I love it), but this seems like the kind of task Perl was created
for.

--
-Rob Hoelz

On Tue, 12 Dec 2006 22:07:14 +0100 "Marc Dubois" <no@spam.com>
wrote:
hi,
is it possible to parse an XML file in C so that i can fulfill
these requirements :
1) replace all "<" and ">" signs inside the body of tag by a space,
e.g. : Example 1:
<fooblabla < bla </foo>

becomes

<fooblabla bla </foo>

Example 2:

<foo>blablabl a </foo>

becomes
<fooblablabla </foo>
2) Remove all extra spaces at the end of every line of the XML file
3) Replace all special characters ( Unicode or Hexadecimal
characters) by a space
I mean the XML file is not well formed if there are "<" and ">"
signs a little bit everywhere,
it is not a valid file in that case, so i do not think the use of a
parser would be appropriate in that case. (How would the parser
react when it encounters a < that does not correspond to the
beginning of a tag ???)

Do you have an idea on how i can write a program to deal with these
requirements ?
Technical environment is : Unix, KSH, and C (gcc)

I am thinking of using the "sed" command instead, i can get rid of
the extra spaces and replace the special characters but i still do
not know how to deal with the extra ">" and "<" signs.

Thanks for your help.



--
-Rob Hoelz
Dec 12 '06 #6
Rob Hoelz wrote:
Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >
Dec 12 '06 #7
"Default User" <de***********@ yahoo.comwrites :
Rob Hoelz wrote:
>Just curious, why do you want to use C for this?


Please don't top-post. Your replies belong following or interspersed
with properly trimmed quotes. See the majority of other posts in the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >
Lecturing on top posting is OT.
Dec 13 '06 #8
Richard wrote:
"Default User" <de***********@ yahoo.comwrites :
>Rob Hoelz wrote:
>>Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or
interspersed
with properly trimmed quotes. See the majority of other posts in
the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >

Lecturing on top posting is OT.
It's somwehow ironic but: so is lecturing on OT :-)

--
Johannes
You can have it:
Quick, Accurate, Inexpensive.
Pick two.
Dec 13 '06 #9
"John F" <sp**@127.0.0.1 writes:
Richard wrote:
>"Default User" <de***********@ yahoo.comwrites :
>>Rob Hoelz wrote:

Just curious, why do you want to use C for this?

Please don't top-post. Your replies belong following or
intersperse d
with properly trimmed quotes. See the majority of other posts in
the
newsgroup, or:
<http://www.caliburn.nl/topposting.html >

Lecturing on top posting is OT.

It's somwehow ironic but: so is lecturing on OT :-)
By convention, meta-discussions about topicality are considered
topical.

In my opinion, discussions about how to post properly should also be
considered topical. If nobody ever complained about top-posting, we'd
end up with an ugly mixture of top-posting, bottom-posting,
mid-posting, and whatever other forms of posting some random person
decides Looks Really Cool. The newsgroup will become more difficult
to read, and those who spend the most time here will lose patience and
give up on the newsgroup. Since spending a lot of time here
correlates fairly strongly (but not perfectly) with expertise, I
suggest that this would be to the great detriment of the newsgroup.

Personally, I *usually* don't complain about top-posting unless I
happen to be replying to the article anyway.

Perhaps we should agree on a de facto standard tag, like "[TP]", for
articles that complain about top-posting without adding new content.
Or perhaps there should be a more generic tag for criticisms of
posting style. (In my opinion, articles that complain about posting
style *and* discuss C need no such tag.)

--
Keith Thompson (The_Other_Keit h) ks***@mib.org <http://www.ghoti.net/~kst>
San Diego Supercomputer Center <* <http://users.sdsc.edu/~kst>
We must do something. This is something. Therefore, we must do this.
Dec 13 '06 #10

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

2
1924
by: Roberto A. F. De Almeida | last post by:
Hi, I'm interested in parsing a file containing this "structure": """dataset { int catalog_number; sequence { string experimenter; int32 time; structure {
2
3428
by: Oxmard | last post by:
Armed with my new O'Reilly book Optimizing Oracle Performance I have been trying to get a better understanding of how Oracle works. The book makes the statement, " A database cal with dep=n + 1 is the recursive child of the first subsequent dep=n database call listed in the SQL data stream. The book gives a few examples, and in trying it out it seemed to work until I tried the following SQL. My question are why does this not keep with...
2
3958
by: Cigdem | last post by:
Hello, I am trying to parse the XML files that the user selects(XML files are on anoher OS400 system called "wkdis3"). But i am permenantly getting that error: Directory0: \\wkdis3\ROOT\home Canonicalpath-Directory4: \\wkdis3\ROOT\home\bwe\ You selected the file named AAA.XML getXmlAlgorithmDocument(): IOException Not logged in
8
1542
by: H | last post by:
Now, I'm here with another newbie question .... I want to read a text file, string by string (to do some things with some words etc etc), but I can't seem to find a way to do this String by String. Is there anyway, like String s = something.ReadString() ? Or what may be a fine way to do this ? Only thing I can some up with is to read 1 char at a time, and look if the next char is a space-sign, and that way "make" the Strings myself....
7
10288
by: christian.eickhoff | last post by:
Hi Everyone, I am currently implementing an XercesDOMParser to parse an XML file and to validate this file against its XSD Schema file which are both located on my local HD drive. For this purpose I set the corresponding XercesDOMParser feature as shown in the upcoming subsection of my code. As far as I understand, the parsing process should throw an DOMException in case the XML file doesn't match the Schema file (e.g. Element...
5
3794
by: baskarpr | last post by:
Hi all, I my program after parsing in SAX parser, I want to write the parse result as an XML file. I want to ensure that there should be no difference between source XML file and parse result xml file. Because I set some properties in parser, which may cause to changes between actual and parsed. What I expect is the exact XML file structure is to be available into another XML file (incl white spc's) after SAX parsing. Below is a snippet...
5
64651
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C++ programming. FYI Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make...
1
64188
AdrianH
by: AdrianH | last post by:
Assumptions I am assuming that you know or are capable of looking up the functions I am to describe here and have some remedial understanding of C programming. FYI Although I have called this article “How to Parse a File in C++”, we are actually mostly lexing a file which is the breaking down of a stream in to its component parts, disregarding the syntax that stream contains. Parsing is actually including the syntax in order to make...
7
2851
by: souravmallik | last post by:
Hello, I'm facing a big logical problem while writing a parser in VC++ using C. I have to parse a file in a chunk of bytes in a round robin fashion. Means, when I select a file, the parser will read first 512kb(IBUFFSIZE) of data, then move to next file and parse the same way. This way I can parse a number of file spreaded over different directory uniformly. I'm keeping a meta data in a file where I'm keeping the track of file parse...
2
3615
by: Felipe De Bene | last post by:
I'm having problems parsing an HTML file with the following syntax : <TABLE cellspacing=0 cellpadding=0 ALIGN=CENTER BORDER=1 width='100%'> <TH BGCOLOR='#c0c0c0' Width='3%'>User ID</TH> <TH Width='10%' BGCOLOR='#c0c0c0'>Name</TH><TH width='7%' BGCOLOR='#c0c0c0'>Date</TH> and so on.... whenever I feed the parser with such file I get the error :
0
9489
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However, people are often confused as to whether an ONU can Work As a Router. In this blog post, we’ll explore What is ONU, What Is Router, ONU & Router’s main usage, and What is the difference between ONU and Router. Let’s take a closer look ! Part I. Meaning of...
0
9298
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can effortlessly switch the default language on Windows 10 without reinstalling. I'll walk you through it. First, let's disable language synchronization. With a Microsoft account, language settings sync across devices. To prevent any complications,...
0
10072
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers, it seems that the internal comparison operator "<=>" tries to promote arguments from unsigned to signed. This is as boiled down as I can make it. Here is my compilation command: g++-12 -std=c++20 -Wnarrowing bit_field.cpp Here is the code in...
0
9906
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven tapestry of website design and digital marketing. It's not merely about having a website; it's about crafting an immersive digital experience that captivates audiences and drives business growth. The Art of Business Website Design Your website is...
1
9885
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows Update option using the Control Panel or Settings app; it automatically checks for updates and installs any it finds, whether you like it or not. For most users, this new feature is actually very convenient. If you want to control the update process,...
0
8737
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing, and deployment—without human intervention. Imagine an AI that can take a project description, break it down, write the code, debug it, and then launch it, all on its own.... Now, this would greatly impact the work of software developers. The idea...
1
7286
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new presenter, Adolph Dupré who will be discussing some powerful techniques for using class modules. He will explain when you may want to use classes instead of User Defined Types (UDT). For example, to manage the data in unbound forms. Adolph will...
0
6562
by: conductexam | last post by:
I have .net C# application in which I am extracting data from word file and save it in database particularly. To store word all data as it is I am converting the whole word file firstly in HTML and then checking html paragraph one by one. At the time of converting from word file to html my equations which are in the word document file was convert into image. Globals.ThisAddIn.Application.ActiveDocument.Select();...
0
5329
by: adsilva | last post by:
A Windows Forms form does not have the event Unload, like VB6. What one acts like?

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.