By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
446,201 Members | 920 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 446,201 IT Pros & Developers. It's quick & easy.

Parsing file to extraction records

P: n/a
M
Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.

The complexity here is that I won't know the exact format of the files
until run time. The files may or may not contain headers and trailers
and the format is not known yet. The records may have clearly defined
start and end markers but they may not. There may be a fixed separator
between the records or there may not. (Separators will be used if
there are no record start and end markers).

The current idea is to use UNIX regular expressions to define the
format of the parts of the file and match them up at run time. However
it is not clear whether it would be possible to develop single
expressions for the whole file or whether I would have to use separate
regular expressions for each part of the file (header, trailer,
separator, begin/end record etc.). If a single expression is used I
would imagine the expression would match all the data records rather
than being able to recognise individual records.

This code is to extend an application already written in C running on
UNIX (&OpenVMS) platforms.

I would be grateful for some thoughts on how this could be achieved.

Mar 9 '06 #1
Share this Question
Share on Google+
7 Replies


P: n/a

M wrote:
Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.
I believe this question is better suited for comp.programming or
similar...
The complexity here is that I won't know the exact format of the files
until run time. The files may or may not contain headers and trailers
and the format is not known yet. The records may have clearly defined
start and end markers but they may not. There may be a fixed separator
between the records or there may not. (Separators will be used if
there are no record start and end markers).
I don't really understand how you're going to cater for this level of
indeterminacy.
The current idea is to use UNIX regular expressions to define the
format of the parts of the file and match them up at run time. However
it is not clear whether it would be possible to develop single
expressions for the whole file or whether I would have to use separate
regular expressions for each part of the file (header, trailer,
separator, begin/end record etc.). If a single expression is used I
would imagine the expression would match all the data records rather
than being able to recognise individual records.


If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).

Still, nothing C-specific here. Try some other groups.

--
BR, Vladimir

Mar 9 '06 #2

P: n/a
M
Thank for your response.
I believe this question is better suited for comp.programming or
similar...
It is posted to comp.programming (and crossposted to comp.lang.c)
If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).
This seems even more difficult than the ideas I discussed. Maybe I did
not
explain the requirements well. The program has to cope with a variety
of
different file formats. Hence the need to make the program flexible.
The
file format would be specified in a database or configuation file and
would be
fixed for any particular instance of the program. However there will
be many
such programs running on different installations all reading different
file formats.
Still, nothing C-specific here. Try some other groups.


It's got to be written in C. I think that is specific :-)

M

Mar 9 '06 #3

P: n/a
NB: Posted just to comp.lang.c

M wrote:
Thank for your response.
I believe this question is better suited for comp.programming or
similar...


It is posted to comp.programming (and crossposted to comp.lang.c)


Sorry, I did not see this.
If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).


This seems even more difficult than the ideas I discussed. Maybe I did
not explain the requirements well. The program has to cope with a variety
of different file formats. Hence the need to make the program flexible.
The file format would be specified in a database or configuation file and
would be fixed for any particular instance of the program. However there will
be many such programs running on different installations all reading different
file formats.


You suggested regular expressions. I suggested a simplified form (in
different words), specific to your implementation. Where the
description is stored is really immaterial.
Still, nothing C-specific here. Try some other groups.


It's got to be written in C. I think that is specific :-)


You're really after the method, which can be implemented in any
language.

This group (c.l.c) discusses the C language only. Once you implement
this in C (or start implementing it), and have a question about
/implementation/ using standard C, this is the place to ask about it.
(Although, as you will have noticed, we do tend to give it a stab,
while pointing to the better place to ask. ;-) )

--
BR, Vladimir

Mar 9 '06 #4

P: n/a
M said:
Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.

The complexity here is that I won't know the exact format of the files
until run time.
Been there, done that, got the tee-shirt in several different shapes and
sizes. We ended up writing a data language. (Well, I say we, but I had very
little to do with it actually.) I'm fairly sure I've described it here
before. A descriptor file (text, of course) was used to identify which
fields were present in which locations and how wide they were, that sort of
thing.
The files may or may not contain headers and trailers
and the format is not known yet.
You just said they would have a header and a trailer. The exact format may
be a moveable feast, but you need to establish a consistent meta-format
early on.
I would be grateful for some thoughts on how this could be achieved.


Let's say you wanted to write a C interpreter. (Analogy alert!) To process a
struct definition, you'd have to read it in from the text file, identify
the type of each member, and its name, and (if it's an array) its size. And
you'd have to have some way of finding or updating a particular member's
value, given its name.

You have much the same deal here. Your record is like a C struct, in a way.
(But not in another way. For reading and processing, you will almost
certainly want to be able to access the various fields of a record in a
loop - at least sometimes.) So that gives you a clue about your
configuration file structure. Say, for example, that you are dealing with
orders for nuts and bolts from fifteen different large customers, all of
whom send their orders to you electronically. You might want to have a
config file structure something like this:

FILETYPE Orders
CUSTOMER NutsNBoltsRUs
DEF RECORD Header
CHAR Type
DATE Created
INTEGER RecordCount
ENDDEF
DEF RECORD Bolts
CHAR Type
DATE OrderDate
CHAR 16 ProductCode
STRING Description *
INTEGER Height
INTEGER TopDiameter
CHAR 3 DontCareA
INTEGER TipDiameter
CHAR 3 DontCareB
INTEGER PitchCode
CHAR 6 DontCareC
INTEGER PriceCode
ENDDEF
DEF RECORD Nuts
CHAR Type
DATE OrderDate
CHAR 14 ProductCode
STRING Description *
INTEGER MatCode
INTEGER Depth
INTEGER ExternalDiameter
INTEGER InternalDiameter
INTEGER PitchCode
INTEGER PriceCode
CHAR 12 DontCareD
INTEGER ColourCode
ENDDEF

As you can see, this is easily extensible, and its purpose is to describe
the file format supplied by a particular customer. Thus, its layout will
vary depending on that format. The above example contains some fields that
we simply aren't interested in, but we have to know enough about them to be
able to ignore them - hence the "DontCare" entries. And at runtime, you
simply read the config file to find out where in a record the relevant
field information was. You'll end up with functions to read a record, work
out what record type it is, find a field within a given record either by
name or by index, etc etc. Nothing terribly hard, but needs careful
planning.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)
Mar 9 '06 #5

P: n/a
It is impossible to use regex w/o knowing the file formats.

If you can provide further information on what you want to do with your
program, and I will try to provide some further assistance.

Mar 10 '06 #6

P: n/a

"Programming Master" <ei*********@gmail.com> wrote in message
news:11**********************@e56g2000cwe.googlegr oups.com...
It is impossible to use regex w/o knowing the file formats.

If you can provide further information on what you want to do with your
program, and I will try to provide some further assistance.


I think the OP is saying the program WILL know the file formats...
except only at runtime, instead of at compile time.

Mar 10 '06 #7

P: n/a
M
> I think the OP is saying the program WILL know the file formats...
except only at runtime, instead of at compile time.


Correct. The program will have to cope with many different file
formats (conforming to the specification from my original post). The
exact format will be known at run time and may be specified in terms of
regular expressions.

The purpose of this application is to interpret data files from many
different clients. Each
client uses a slightly different file format. My program has to be
able to read all the files.

I have now completed a prototype, based on the provision of five
different regular expressions to define a file format. It would be
nice to reduce the number of
expressions necessary - but I can't see a way of doing this. This is
really what the
original post was about - using a single RE.

Mark

Mar 13 '06 #8

This discussion thread is closed

Replies have been disabled for this discussion.