M said:
Hi,
I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.
The complexity here is that I won't know the exact format of the files
until run time.
Been there, done that, got the tee-shirt in several different shapes and
sizes. We ended up writing a data language. (Well, I say we, but I had very
little to do with it actually.) I'm fairly sure I've described it here
before. A descriptor file (text, of course) was used to identify which
fields were present in which locations and how wide they were, that sort of
thing.
The files may or may not contain headers and trailers
and the format is not known yet.
You just said they would have a header and a trailer. The exact format may
be a moveable feast, but you need to establish a consistent meta-format
early on.
I would be grateful for some thoughts on how this could be achieved.
Let's say you wanted to write a C interpreter. (Analogy alert!) To process a
struct definition, you'd have to read it in from the text file, identify
the type of each member, and its name, and (if it's an array) its size. And
you'd have to have some way of finding or updating a particular member's
value, given its name.
You have much the same deal here. Your record is like a C struct, in a way.
(But not in another way. For reading and processing, you will almost
certainly want to be able to access the various fields of a record in a
loop - at least sometimes.) So that gives you a clue about your
configuration file structure. Say, for example, that you are dealing with
orders for nuts and bolts from fifteen different large customers, all of
whom send their orders to you electronically. You might want to have a
config file structure something like this:
FILETYPE Orders
CUSTOMER NutsNBoltsRUs
DEF RECORD Header
CHAR Type
DATE Created
INTEGER RecordCount
ENDDEF
DEF RECORD Bolts
CHAR Type
DATE OrderDate
CHAR 16 ProductCode
STRING Description *
INTEGER Height
INTEGER TopDiameter
CHAR 3 DontCareA
INTEGER TipDiameter
CHAR 3 DontCareB
INTEGER PitchCode
CHAR 6 DontCareC
INTEGER PriceCode
ENDDEF
DEF RECORD Nuts
CHAR Type
DATE OrderDate
CHAR 14 ProductCode
STRING Description *
INTEGER MatCode
INTEGER Depth
INTEGER ExternalDiameter
INTEGER InternalDiameter
INTEGER PitchCode
INTEGER PriceCode
CHAR 12 DontCareD
INTEGER ColourCode
ENDDEF
As you can see, this is easily extensible, and its purpose is to describe
the file format supplied by a particular customer. Thus, its layout will
vary depending on that format. The above example contains some fields that
we simply aren't interested in, but we have to know enough about them to be
able to ignore them - hence the "DontCare" entries. And at runtime, you
simply read the config file to find out where in a record the relevant
field information was. You'll end up with functions to read a record, work
out what record type it is, find a field within a given record either by
name or by index, etc etc. Nothing terribly hard, but needs careful
planning.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)