using large XML for interfaces

averst69

Hi Y'all ,

A client of our company asked me to look for alternatives for the
current interfacing method they use now.
They use large files in which the data are stored in records like
below:
0000000XX21123456789DoeJohn11091901MWashington etc
^ ^ ^ ^ ^
^
Record ID ID name birth sex place of birth
These file are huge (over 100 Mb).
They asked our company to look for an alternative method of interfacing
these data. They were thinking of using XML.
What I 've read untill now is that it's, in case of large data sets,
not wise to use XML.
- The files become even more bigger
- processing time goes up, masssive usage of memory

What do you guys think of these arguments? Are there any other
alternatives ?

greetz Aschwin

Oct 25 '06 #1

Subscribe Post Reply

1270

Andy Dingley

av******@gmail.com wrote:

These file are huge (over 100 Mb).
They asked our company to look for an alternative method of interfacing
these data. They were thinking of using XML.

XML is an easy, almost trivial, drop-in replacement for CSV file
interchange like this. There are only a couple of issues to be aware
of:

* Use an event-driven parser like SAX, not a monolithic "parse it all
then use it" DOM

* XML requires a "complete" document, so it's hard to work with reading
documents that are being continually appended to. (There must be one
root element, and this will appear at both the beginning of the
document (start tag) and end of the document (end tag)).

Non issues are:

* Verbosity. Practical XML documents are frequently smaller than
equivalent fixed-field documents because they handle sparse data much
better. They can even be smaller than some CSV formats.

* Verbosity. Yes, XML adds repeated tag names to the document. In
practice this just isn't a problem (for one thing they get compressed
very well in transmission). It's certainly no reason to try for
unreadable <NAM<ADRelement names!

* Speed. XML parsers are efficiently coded against formal syntax
definitions. They almost always beat custom-written informal parsers
written in application scripting languages.

Particular benefits are:

* Reliability. XML _works_. It works reliably for any input data too,
because it's a well thought-through protocol. Wave goodbye to all those
awkward names that broke the comment or apostrophe escaping algorithm
you had coded by a junior intern. "O'Reilly" won't break it, nor will
<an arabic name I can't even paste into Usenet>

* Internationalization. Oh yes. It just does it. For any encoding.
With no effort on your part. Rejoice!

* Interoperability. Your XML is my XML. Guaranteed. No more CSV
encoding hangups between systems.
There _are_ good ways to break XML.

In particular, XML isn't a database. A 100MB document is certainly
workable as a transfer document, but it's not usually a good idea to
load it into a DOM and then try repeated random lookups into it.

XML isn't a messaging protocol either. If you have lots of tiny
messages flying around, then wrap up your XML in something else (maybe
SOAP) and use that. You can't do some of the tricks you used to do with
a CSV file, such as treating them like a pipe and reading from one end
whilst still writing to the other.

XML has rules, so stick to them. Vanilla ASCII isn't too much trouble,
but if you're going to thhrow lots of "<" , ã‚„ or Å™ around, then
learn what the options for encoding them are and use them correctly.

Oct 25 '06 #2

Jürgen Kahrs

av******@gmail.com wrote:

A client of our company asked me to look for alternatives for the
current interfacing method they use now.

Why are they looking for alternatives ?
Why change it if it works ?

They use large files in which the data are stored in records like
below:
0000000XX21123456789DoeJohn11091901MWashington etc
^ ^ ^ ^ ^
^
Record ID ID name birth sex place of birth
These file are huge (over 100 Mb).
They asked our company to look for an alternative method of interfacing
these data. They were thinking of using XML.

If they use XML, they are "buzz-word compliant".
Are there technical or political reasons for XML ?
If there are political reasons, then stop arguing
in technical terms.

What I 've read untill now is that it's, in case of large data sets,
not wise to use XML.
- The files become even more bigger
- processing time goes up, masssive usage of memory

Andy Dingley has summarized the advantages of XML quite well.
I disagree with him when it comes to file size: In your case
(fixed format data) the XML variant _will_ be bigger.

What do you guys think of these arguments? Are there any other
alternatives ?

Andy has already pointed out that XML data may contain
any German Umlaut, Cyrillic or Japanese special character
that you will ever find. This _is_ an advantage.

Oct 25 '06 #3

Andy Dingley

Jürgen Kahrs wrote:

Why are they looking for alternatives ?
Why change it if it works ?

My current project involves pumping vast comma-delimited and
fixed-field files around between vendors, with what looks like similar
data to the OP.
Believe me, there are _plenty_ of reasons to move to XML, not just
fashion.

If they use XML, they are "buzz-word compliant".

XML hasn't been a hot buzzword for years now, it's just plumbing.

If there are political reasons, then stop arguing
in technical terms.

Always wise advice!

I disagree with him when it comes to file size: In your case
(fixed format data) the XML variant _will_ be bigger.

I've just seen a factor of 4 shrinkage in going from a fixed-field file
with address data in it. Most of the original file was simply empty
space for spare address lines, but we were faithfully shipping it
around as a couple of MB of whitespace.

Oct 25 '06 #4

Juergen Kahrs

Andy Dingley wrote:

My current project involves pumping vast comma-delimited and
fixed-field files around between vendors, with what looks like similar
data to the OP.

But he had fixed-width data; it looked like there were
no sparse lines.

XML hasn't been a hot buzzword for years now, it's just plumbing.

The problem for many conservative Unix users is that
the plumbing cant be done with their usual toolset.
XML requires a new toolset and depreciates the old toolset.

I've just seen a factor of 4 shrinkage in going from a fixed-field file
with address data in it. Most of the original file was simply empty
space for spare address lines, but we were faithfully shipping it
around as a couple of MB of whitespace.

If there really are blank fields in the data, then it
sounds plausible that the amount of data shrinks.
But the OP's data didnt look sparse.

Oct 26 '06 #5

Similar topics

Python for large projects

by: assaf__ | last post by:

Hello, I am beginning to work on a fairly large project and I'm considering to use python for most of the coding, but I need to make sure first that it is reliable enough. I need to make sure...

Python

Is there a "Large Scale Python Software Design" ?

by: Andrea Griffini | last post by:

I did it. I proposed python as the main language for our next CAD/CAM software because I think that it has all the potential needed for it. I'm not sure yet if the decision will get through, but...

Python

Designing Data Interface for Very Large Files [more than GB size]

by: shailesh kumar | last post by:

Hi, I need to design data interfaces for accessing files of very large sizes efficiently. The data will be accessed in chunks of fixed size ... My data interface should be able to do a random...

C / C++

Output VALUE of INPUT textfield using document.write

by: Stumped and Confused | last post by:

Hello, I really, really, need some help here - I've spent hours trying to find a solution. In a nutshell, I'm trying to have a user input a value in form's textfield. The value should then be...

Javascript

Problem with wrapping an unmanaged C++ DLL using the header file

by: Lokkju | last post by:

I am pretty much lost here - I am trying to create a managed c++ wrapper for this dll, so that I can use it from c#/vb.net, however, it does not conform to any standard style of coding I have seen....

.NET Framework

SQL Server Express: Anybody Using It?

by: (PeteCresswell) | last post by:

I skimmed the MS spiel at http://msdn.microsoft.com/sql/express, and noted the part about "all inside the Visual Studio 2005 environment". Do the older SQL server tools for security and stored...

Microsoft Access / VBA

what are you using python language for?

by: hacker1017 | last post by:

im just asking out of curiosity.

Python

Best practices for moving large amounts of data using WCF ...

by: =?Utf-8?B?TW9iaWxlTWFu?= | last post by:

Hello everyone: I am looking for everyone's thoughts on moving large amounts (actually, not very large, but large enough that I'm throwing exceptions using the default configurations). We're...

.NET Framework

Problem: "java.lang.OutOfMemoryError: Java heap space" while reading xml using SAX

by: blazedaces | last post by:

Ok, so you know my problem, java is running out of memory reading with SAX, the event-based xml parser intended more-so than DOM for extremely large files. I'll try to explain what I've been doing...

Java

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Basic Javascript concepts

by: aa123db | last post by:

Variable and constants Use var or let for variables and const fror constants. Var foo ='bar'; Let foo ='bar';const baz ='bar'; Functions function $name$ ($parameters$) { } ...

Javascript

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Looking to do Android software development, any suggestions? Is flutter better?

by: nemocccc | last post by:

hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?

General

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware