Parsing complex xml file with C#

Pir8

I have a complex xml file, which contains stories within a magazine. The
structure of the xml file is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<magazine>
<story>
<story_id>112233</story_id>
<pub_name>Puleen's Publication</pub_name>
<pub_code>PP</pub_code>
<edition_date>20031201</edition_date>
<edition_name></edition_name>
<section_name></section_name>
<page_id></page_id>
<headline>My Story Headline</headline>
<subhead>Sub head</subhead>
<byline>Puleen</byline>
<source></source>
<dateline></dateline>
<storytype></storytype>
<column>Search</column>
<company_list></company_list>
<keyword_list></keyword_list>
<text>In other news....second paragraph</text>
<photo>
<caption></caption>
<photo_filename>197943-96068.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96069.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96067.jpg</photo_filename>
<photocredit></photocredit>
</photo>
</story>
</magazine>

So there could be multiple <story>'s for each magazine. Now in the backend,
the data gets stored into an Oracle database. However, the data for the
photo's are stored in a separate table from the actual story. What's the
best way to approach the parsing of the story contents, and building a query
out of it, and then parsing the photo contents and building a query out of
that.

Any ideas are welcome. I've been trying to parse the xml file, however I
cannot think of a quick way of doing this. So I wonder maybe someone out
there, can guide me in the right direction and/or suggest a quick solution.

Nov 15 '05 #1

Subscribe Post Reply

3470

Serialize and Deserialize for persistant storage.

Then play with the object :D Much easier

"Pir8" <pi**@mscorlib.com> wrote in message
news:OM**************@TK2MSFTNGP11.phx.gbl...

I have a complex xml file, which contains stories within a magazine. The
structure of the xml file is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<magazine>
<story>
<story_id>112233</story_id>
<pub_name>Puleen's Publication</pub_name>
<pub_code>PP</pub_code>
<edition_date>20031201</edition_date>
<edition_name></edition_name>
<section_name></section_name>
<page_id></page_id>
<headline>My Story Headline</headline>
<subhead>Sub head</subhead>
<byline>Puleen</byline>
<source></source>
<dateline></dateline>
<storytype></storytype>
<column>Search</column>
<company_list></company_list>
<keyword_list></keyword_list>
<text>In other news....second paragraph</text>
<photo>
<caption></caption>
<photo_filename>197943-96068.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96069.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96067.jpg</photo_filename>
<photocredit></photocredit>
</photo>
</story>
</magazine>

So there could be multiple <story>'s for each magazine. Now in the backend, the data gets stored into an Oracle database. However, the data for the
photo's are stored in a separate table from the actual story. What's the
best way to approach the parsing of the story contents, and building a query out of it, and then parsing the photo contents and building a query out of
that.

Any ideas are welcome. I've been trying to parse the xml file, however I
cannot think of a quick way of doing this. So I wonder maybe someone out
there, can guide me in the right direction and/or suggest a quick solution.

Nov 15 '05 #2

Daniel O'Connell

"Pir8" <pi**@mscorlib.com> wrote in message
news:%2****************@TK2MSFTNGP10.phx.gbl...

The main problem that I am concerned with is that within the <text> there
might be and will be html tags i.e. <a href=""> and so on. I do
realize that I could use the node's innerxml property to retrieve this
but will there be any other complications in the future?
It *should* work, but it depends on your html. I wouldn't want to throw
non-xhtml html at an xml parser, its just not particularly safe. You might
want to consider wrapping the body of text in a CDATA section. The xml format that I pasted is pretty much the same...There are some other tags that I did not include, which
also will go into a separate table of its own into oracle. I asked about the format because, personally, I would have used an id
attribute instead of a <story_id> element.
My main concern is that, when parsing the <story>, <photo> separately, I
need to associate the <story_id>
along with the data from the <photo> section, so as to enter it into the
database to keep the appropriate
relationships for the application that will be using this data.

I will read more about XPath and how it can be helpful. I appreciate your
suggestions. Well, as a very base concept I would probably query the xml document with
XPathNavigator using the xpath query /magazine/story, use the resultant
XPathNodeIterator to grab each story and use subseqent queries to pull out
the various pieces out.
"Daniel O'Connell" <onyxkirx@--NOSPAM--comcast.net> wrote in message
news:Od**************@TK2MSFTNGP10.phx.gbl...
If I may ask, what kind of problems are you having? Serialization is
probably not your only answer(it could have flexibility issues). My
immediate idea would be to use xpath. At that, is this xml format set in
stone?
"Pir8" <pi**@mscorlib.com> wrote in message
news:OM**************@TK2MSFTNGP11.phx.gbl...
I have a complex xml file, which contains stories within a magazine. The structure of the xml file is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<magazine>
<story>
<story_id>112233</story_id>
<pub_name>Puleen's Publication</pub_name>
<pub_code>PP</pub_code>
<edition_date>20031201</edition_date>
<edition_name></edition_name>
<section_name></section_name>
<page_id></page_id>
<headline>My Story Headline</headline>
<subhead>Sub head</subhead>
<byline>Puleen</byline>
<source></source>
<dateline></dateline>
<storytype></storytype>
<column>Search</column>
<company_list></company_list>
<keyword_list></keyword_list>
<text>In other news....second paragraph</text>
<photo>
<caption></caption>
<photo_filename>197943-96068.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96069.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96067.jpg</photo_filename>
<photocredit></photocredit>
</photo>
</story>
</magazine>

So there could be multiple <story>'s for each magazine. Now in the backend,
the data gets stored into an Oracle database. However, the data for the photo's are stored in a separate table from the actual story. What's the best way to approach the parsing of the story contents, and building a

query
out of it, and then parsing the photo contents and building a query out of
that.

Any ideas are welcome. I've been trying to parse the xml file, however

I cannot think of a quick way of doing this. So I wonder maybe someone out there, can guide me in the right direction and/or suggest a quick

solution.

Nov 15 '05 #3

Nick Malik

assuming that the file is valid XML (even with the embedded HTML), you can
easily extract components of the structure using XPath queries, and even
iterate over the structure, pulling out each photo and each item.

The query for story_id is literally: /magazine/story/story_id

--- Nick
"Pir8" <pi**@mscorlib.com> wrote in message
news:OM**************@TK2MSFTNGP11.phx.gbl...

I have a complex xml file, which contains stories within a magazine. The
structure of the xml file is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<magazine>
<story>
<story_id>112233</story_id>
<pub_name>Puleen's Publication</pub_name>
<pub_code>PP</pub_code>
<edition_date>20031201</edition_date>
<edition_name></edition_name>
<section_name></section_name>
<page_id></page_id>
<headline>My Story Headline</headline>
<subhead>Sub head</subhead>
<byline>Puleen</byline>
<source></source>
<dateline></dateline>
<storytype></storytype>
<column>Search</column>
<company_list></company_list>
<keyword_list></keyword_list>
<text>In other news....second paragraph</text>
<photo>
<caption></caption>
<photo_filename>197943-96068.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96069.jpg</photo_filename>
<photocredit></photocredit>
</photo>
<photo>
<caption></caption>
<photo_filename>197943-96067.jpg</photo_filename>
<photocredit></photocredit>
</photo>
</story>
</magazine>

So there could be multiple <story>'s for each magazine. Now in the backend, the data gets stored into an Oracle database. However, the data for the
photo's are stored in a separate table from the actual story. What's the
best way to approach the parsing of the story contents, and building a query out of it, and then parsing the photo contents and building a query out of
that.

Any ideas are welcome. I've been trying to parse the xml file, however I
cannot think of a quick way of doing this. So I wonder maybe someone out
there, can guide me in the right direction and/or suggest a quick solution.

Nov 15 '05 #4

by: Jean de Largentaye | last post by:

Hi, I need to parse a subset of C (a header file), and generate some unit tests for the functions listed in it. I thus need to parse the code, then rewrite function calls with wrong parameters....

Python

Keyword Parsing with ASP

by: ARK | last post by:

I am writing a search program in ASP(VBScript). The user can enter keywords and press submit. The user can separate the keywords by spaces and/or commas and key words may contain plain words,...

ASP / Active Server Pages

parsing complex user inputs

by: Sven Neuberg | last post by:

Hi, I have been handed the task of updating and maintaining a web application, written in ASP and Javascript, that takes complex user inputs in HTML form and submits them to server-side ASP...

Javascript

Parsing Command-Line Arguments with InstallContext

by: mriedel | last post by:

I'm using the InstallContext class to parse the command-line arguments of a console application. The arguments are in the form of "-file=myFile.txt -flag", and the InstallContext object gives me what...

C# / C Sharp

Building several parsing modules

by: Robert Neville | last post by:

Basically, I want to create a table in html, xml, or xslt; with any number of regular expressions; a script (Perl or Python) which reads each table row (regex and replacement); and performs the...

Python

Complex file parsing

by: davebaty | last post by:

I'm relatively new to VB programming (VB 2005), and have come across a problem parsing complex text files. Basically I have a file which has lines something like the following: max_gross_weight...

Visual Basic 4 / 5 / 6

LaTeX-Like Parsing in C

by: nedelm | last post by:

My problem's with parsing. I have this (arbitrary, from a file) string, lets say: "Directory: /file{File:/filename(/size) }" I would like it to behave similar to LaTeX. I parse it, and then I...

C / C++

Parsing Abstract Files.

by: chixor1 | last post by:

I have been charged with Parsing the data from many Abstract files, and then inputing this information into a SQL Database. The file format is rather unusual and certainly not delimited in any...

Microsoft SQL Server

wsdl parsing

by: padmagvs | last post by:

I am working on some code which parses wsdl . I have a complex wsdl which is failing to parse . I have to modify this wsdl for parsing . wanted to know the complex wsdl i am using is as per...

XML

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

Parsing complex xml file with C#

Similar topics