473,326 Members | 2,134 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,326 software developers and data experts.

humongous flat file

It has been suggested that rather than convert an already large flat
file, with many similar rows, to XML, some type of header be attached
to the file, containing some sort of meta-XML description of the rows
that follow. The hope is that the result will not grow as large as a
pure XML file, but still be easy to exchange. Multiple vendors would
still be able to track format changes easily. The size of the flat
file, without XML, is already an issue.

If it is not already apparent, I'm new to XML. Does anything like this
already exist? Thanks.

Dennis Farr
Treefrog Enterprises

-- "Can giraffes swim?" --
Jul 20 '05 #1
10 2718
On 7 Aug 2003 07:39:57 -0700, df***@comcast.net (Dennis Farr) wrote:
If it is not already apparent, I'm new to XML. Does anything like this
already exist? Thanks.


It's a bad idea, don't do it. These ideas were popular in the last
century, when the "verbosity" of XML was seen as a problem.

It isn't. get over it.
If you want to do XML, then do it. It's not rocket science.

Don't invent some whacko new pseudo-XML protocol to fix problems that
aren't there.

If you hate XML, then just say so. Enjoy your punch cards.

Jul 20 '05 #2

"Dennis Farr" <df***@comcast.net> wrote in message
news:c9**************************@posting.google.c om...
It has been suggested that rather than convert an already large flat
file, with many similar rows, to XML, some type of header be attached
to the file, containing some sort of meta-XML description of the rows
that follow. The hope is that the result will not grow as large as a
pure XML file, but still be easy to exchange. Multiple vendors would
still be able to track format changes easily. The size of the flat
file, without XML, is already an issue.

If it is not already apparent, I'm new to XML. Does anything like this
already exist? Thanks.

Dennis Farr
Treefrog Enterprises

-- "Can giraffes swim?" --


If your flat file contains fixed length records and the data is textual then
you may already have existing overheads with redundant trailing spaces.
These spaces would not be carried over to the XML file, hence you may have a
large or some significant reduction in file size. There is no need to be
overly verbose in your XML tag names for instance <CustomersSurname> tag can
be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are
irrelevant to storing the data. An end application can provide the wordy
descriptives.

Denis
Jul 20 '05 #3
"Denis Saunders" <de************@norcross.com.au> wrote in message news:<bg*********@au-nws-0001.flow.com.au>...
If your flat file contains fixed length records and the data is textual then
you may already have existing overheads with redundant trailing spaces.
These spaces would not be carried over to the XML file, hence you may have a
large or some significant reduction in file size. There is no need to be
overly verbose in your XML tag names for instance <CustomersSurname> tag can
be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are
irrelevant to storing the data. An end application can provide the wordy
descriptives.

Denis


Thanks. My data files are a mixture of rows from several database
tables and for the most part there is no white space but tens of
(mostly short and fixed length and encoded) columns per table, so the
shortest tag names would at least double the size of the file.

It would be nice to give an XML-like skeleton for each type of
database row at the top of the file, and then just tag the records as
to which table they come from, and then use the appropriate skeleton
to parse the text in the tag. There may be thousands to tens of
thousands of rows of each type, so the size savings would be
considerable if we could do this, and if there is a way to do this and
stay within established standards, that would make my day.

I know it is a bit stone-age to complain about storage space, but that
depends on the details of the applications, and quadrupling the size
of a really large file can still be expensive. Size also affects
transmission time, especially if encryption is involved. I'm not
knocking XML, I'm hoping to make XML more attractive to more people.
Jul 20 '05 #4
"Dennis Farr" <df***@comcast.net> wrote in message
news:c9**************************@posting.google.c om...
"Denis Saunders" <de************@norcross.com.au> wrote in message news:<bg*********@au-nws-0001.flow.com.au>...
If your flat file contains fixed length records and the data is textual then you may already have existing overheads with redundant trailing spaces.
These spaces would not be carried over to the XML file, hence you may have a large or some significant reduction in file size. There is no need to be
overly verbose in your XML tag names for instance <CustomersSurname> tag can be reduced to <CS> as long as you keep uniqueness. Descriptive tag names are irrelevant to storing the data. An end application can provide the wordy
descriptives.

Denis


Thanks. My data files are a mixture of rows from several database
tables and for the most part there is no white space but tens of
(mostly short and fixed length and encoded) columns per table, so the
shortest tag names would at least double the size of the file.


It seems like you really really really want to use csv,
but also get the seal of approval as xml.
Advantage of xml is that there are a lot of parsers for reading it.
If you kludge up the content, you lose that.
However, you can do

<everything>
<file1>
<row-csv>1,2,3333</row-csv>
</file1>
</everything>

Also, you can add the csv headings.
Highly unrecommended.
I know it is a bit stone-age to complain about storage space, but that
depends on the details of the applications, and quadrupling the size
of a really large file can still be expensive. Size also affects
transmission time, especially if encryption is involved. I'm not
knocking XML, I'm hoping to make XML more attractive to more people.


Don't forget compression. All the repetitive tags are reduced to a few bits
each.

Jul 20 '05 #5
On 8 Aug 2003 09:15:09 -0700, df***@comcast.net (Dennis Farr) wrote:
Size also affects
transmission time, especially if encryption is involved.


No it doesn't. If there are repeated strings in the file, then it
improves compression efficiency. All significant transmissions are
compressed these days, so this verbosity just doesn't matter in
practice. This "XML is inefficient, so use cryptic 2-character element
names" approach is completely bogus.
Jul 20 '05 #6
Andy Dingley wrote:
On 8 Aug 2003 09:15:09 -0700, df***@comcast.net (Dennis Farr) wrote:

Size also affects
transmission time, especially if encryption is involved.

No it doesn't. If there are repeated strings in the file, then it
improves compression efficiency. All significant transmissions are
compressed these days, so this verbosity just doesn't matter in
practice. This "XML is inefficient, so use cryptic 2-character element
names" approach is completely bogus.

Have you tried testing that hypothesis? I have, and although I hate
cryptic 2-character element names just as much you do, the fact is that
it actually does compress better. Here's a link to an IBM site which
illustrates this using test data:

http://www-106.ibm.com/developerwork...matters13.html

Note, however, that there are probably better ways to address this than
the method mentioned in the article. One possibility might be

http://www.w3.org/TR/wbxml/

It's worth noting that this is NOT a w3c recommendation. It's also
worth noting that I haven't actually ever tried wbxml, so you can
consider this my own untested hypothesis and treat it accordingly! :-)

I would be interested to hear from those who have successfully used
alternative encodings for XML, especially ones for which the size
reduction was a primary motivation.

Ed

Jul 20 '05 #7
On Fri, 08 Aug 2003 22:40:57 -0400, Ed Beroset
<be*****@mindspring.com> wrote:
Have you tried testing that hypothesis?


Yes, about 4 years ago - it's last century's problem.

Even then, I was juggling XML and rich-media. XML is primarily a
format for text content, so it's just _tiny_ in comparison to any
image or video data. There's just no point in worrying over element
name lengths, when there are JPEGs on the same server.

Mainly I work in RDF. Fairly long names, lots of repetition of
properties like "type", and honking great URIs all over the place.
Switching <foo> to <fo> isn't going to make a blind bit of difference.

Now encoding schemes for embedding binary data into XML content, now
that's an issue worth saving bytes over.
Jul 20 '05 #8
Andy Dingley wrote:
On Fri, 08 Aug 2003 22:40:57 -0400, Ed Beroset
<be*****@mindspring.com> wrote:

Have you tried testing that hypothesis?
Yes, about 4 years ago - it's last century's problem.

Even then, I was juggling XML and rich-media. XML is primarily a
format for text content, so it's just _tiny_ in comparison to any
image or video data.


I don't think that's the kind of data the OP had in mind. In the
context of video data, it might indeed be tiny by comparison, but I
suspect that most of us work with "last century's data" and so we still
think about things like bandwidth, efficiency, and other anachronistic
concepts of engineering.
Mainly I work in RDF. Fairly long names, lots of repetition of
properties like "type", and honking great URIs all over the place.
Switching <foo> to <fo> isn't going to make a blind bit of difference.


In that context, maybe not, but let's try an experiment with real data
of the non-RDF variety.

The experiment:

I chose the Wake County, North Carolina voter database as the source for
my sample data. It's freely downloadable from the web, contains very
typical kind of name and address data, and is large enough (with 415613
records) to be able to draw some useful conclusions. I extracted the
first five fields of each record of that plain-text database which the
state government labels voter_reg_number, last_name, first_name,
midl_name, and name_sufx. I think those are sufficiently expressive
names that we'd all be able to guess their meanings without a second
thought, so I used them as tag names, too. Wrapping each record up in
<voter></voter> delimiters and the whole thing in <voters></voters>
tags, and minimal other stuff, my test file turns out to be 60685379
bytes long using an 8-bit encoding and Unix-style line endings (one per
record).

Compression:

First, I tried various techniques to reduce the size of the XML file.

The original file is voters1.xml and each voter record has these fields:
voter_reg_number, last_name, first_name, midl_name, and name_sufx

The second file is voters2.xml and each voter record has these fields:
reg_number, last_name, first_name, midl_name, and name_sufx
(The change is that voter_reg_number became just reg_number.)

The third file is voters3.xml and each voter record has these fields:
reg_number, name
Within name there are four fields: last, first, midl, and sufx
(The change is that name now has subfields.)

The fourth file is voters4.xml and each voter record has these fields:
reg_number, foo
Within name there are four fields: last, first, midl, and sufx
(The change is that name is changed to foo.)

The fourth file is voters4.xml and each voter record has these fields:
reg_number, fo
Within name there are four fields: last, first, midl, and sufx
(The change is that foo is changed to fo.)

Here are the sizes and names of the files generated:

60685379 voters1.xml
55697543 voters2.xml
44474912 voters3.xml
43643606 voters4.xml
42812300 voters5.xml

18250 voters1.xml.bz2
17519 voters2.xml.bz2
14251 voters3.xml.bz2
13921 voters4.xml.bz2
12520 voters5.xml.bz2

I'll leave it to you to analyze all the details, since I've provided all
the data to do that, but I thought I'd point out a couple of salient
points. Just a judicious use of shorter tags gives a compressed file
that's 22% smaller (voters3.xml.bz2 compared to voters1.xml.bz2) and no
less comprehensible by humans. Also, note that contrary to your guess,
a change of a single tag from <foo> to <fo> yields a 10% decrease in
size in the compressed files (voters5.xml.bz2 compared to
voters4.xml.bz2) even though the uncompressed versions of those files
only decreased in size by less than 2%.

Conclusions:
1. Using shorter tags may indeed save transmission time.
2. Restructuring "flat" data may give better results without sacrificing
clarity to human readers.
3. Sometimes results are counterintuitive and data-dependent. Measuring
effects on your actual data and comparing those to the engineering
problem to be solved is the only sure way to proceed.

I hope that helps clarify things. If anyone would like to duplicate
this experiment, you can find the raw data at
http://msweb03.co.wake.nc.us/bordele...vesOptions.asp

Ed

Jul 20 '05 #9
On 11 Aug 2003 06:55:01 -0700, df***@comcast.net (Dennis Farr) wrote:
When the data is as voluminous as, for example, an individual's
genetic makeup on the back of a health card, what if the space taken
up by the XML tags is much larger
What indeed. Moore's Law. Throw some hardware at it.

The problem is not about storing this stuff. My mobile phone gives a
gazzilion bytes over to just storing ring tones. I don't even know how
big the HD in my laptop is, it's just "big". Storage is not today's
big problem.

Now go to a library and work with MARC records for a while (or SS7, or
almost anything where ASN.1 has played a part). Then find some old
records from such a system and try to make sense of them. Chances are
you can't. This is a serious problem. Find a digital dataset that's
over 10 years old and read it. The failure rate is terrifying (read up
on the BBC's Domesday Disk project)

I don't give a damn about storage size - not my problem, I've got
computers to do that for me. What I care about is future human
understandability, or if I'm really lucky, machine understandability.
Is that the next logical step of evolution after XML?
Bioinformatics is just one example of really huge data files
Go take a look at Stanford's Protege project.

Or RDF, or DAML, or OWL

http://msdn.microsoft.com/library/de...zdesk_riqj.asp
seems to be on the right track. But I would prefer open source.


Right track ? It's not even leaving the station.

This is a regular approach to the problem and it's more bogus than a
Cayman Islands $3 bill. Taking the dataset (with the implicit
assumption that all XML data is extracted from an RDBMS) and then
labelling it as "row/column" adds nothing to the semantics of the
representation and it is perpetuating the database structure you've
just pulled it from. It's no better than CSV !

XML has a restrictive data model. It's a single-rooted tree, when the
real world is more like a directed graph. But even so, it's a lot more
expressive than this narrow "everything is a rectangular grid"
approach.

Jul 20 '05 #10
"Ed Beroset", Andy Dingley and Dennis Farr wrote:
Size also affects
transmission time, especially if encryption is involved.
No it doesn't. If there are repeated strings in the file, then it
improves compression efficiency. All significant transmissions are
compressed these days, so this verbosity just doesn't matter in
practice. This "XML is inefficient, so use cryptic 2-character element
names" approach is completely bogus.


This depends on the sequence: encrypt-then-compress does poorly:
the repetitive tage are transformed into dissimilar strings, and they don't
compress. Compress-then-encrypt is as good as plain compression.
Q: Which order is actually used? What does https do? What if the
source files are encrypted already?

Have you tried testing that hypothesis? I have, and although I hate
cryptic 2-character element names just as much you do, the fact is that
it actually does compress better. Here's a link to an IBM site which
illustrates this using test data:

http://www-106.ibm.com/developerwork...matters13.html


Very interesting analysis. To get the max compression, it looks like
we need to compress before sending, rather than relying on the comm
link to choose compression for us.

--
Steve
Jul 20 '05 #11

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

1
by: Tim Arnold | last post by:
Hi, I've got a list of 1000 common misspellings, and I'd like to check a set of text files for those misspellings. I'm trying to figure out the fastest way to do it; here's what I'm doing now...
13
by: raykyoto | last post by:
Hi all, I'm sure this is a popular question that comes up every few months here. Indeed, I've looked at some of the past postings, but I would like to ask things differently. Basically, I'm...
1
by: Tim Fierro | last post by:
Hello, I have had many years using flat file databases (File Express from way back) but am now at a company where a relational database is needed and would carry us into the future. Since I...
22
by: Daniel Billingsley | last post by:
Ok, I wanted to ask this separate from nospam's ridiculous thread in hopes it could get some honest attention. VB6 had a some simple and fast mechanisms for retrieving values from basic text...
4
by: Ben | last post by:
So, at my place of employment, we use a national standard to transmit data between certain applications. This standard consists of a fixed width, flat file 4500-some-odd chars wide that contain...
9
by: FFMG | last post by:
In my site I have a config table, (MySQL), with about 30 entries; the data is loaded on every single page load. This is not the only call to the db, (we do a total of about 8 calls to the db). As...
2
by: murthydb2 | last post by:
Hi My requirement is that i have to write a stored procedure in db2 and that will be executed in a batch file . Any system error or validation error that occurs inside the db2 sp during...
15
by: lxyone | last post by:
Using a flat file containing table names, fields, values whats the best way of creating html pages? I want control over the html pages ie 1. layout 2. what data to show 3. what controls to...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
1
isladogs
by: isladogs | last post by:
The next Access Europe meeting will be on Wednesday 6 Mar 2024 starting at 18:00 UK time (6PM UTC) and finishing at about 19:15 (7.15PM). In this month's session, we are pleased to welcome back...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
1
by: CloudSolutions | last post by:
Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...
1
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.