Parsing file to extraction records

Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.

The complexity here is that I won't know the exact format of the files
until run time. The files may or may not contain headers and trailers
and the format is not known yet. The records may have clearly defined
start and end markers but they may not. There may be a fixed separator
between the records or there may not. (Separators will be used if
there are no record start and end markers).

The current idea is to use UNIX regular expressions to define the
format of the parts of the file and match them up at run time. However
it is not clear whether it would be possible to develop single
expressions for the whole file or whether I would have to use separate
regular expressions for each part of the file (header, trailer,
separator, begin/end record etc.). If a single expression is used I
would imagine the expression would match all the data records rather
than being able to recognise individual records.

This code is to extend an application already written in C running on
UNIX (&OpenVMS) platforms.

I would be grateful for some thoughts on how this could be achieved.

Mar 9 '06 #1

Subscribe Post Reply

2309

Vladimir S. Oka

M wrote:

Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.
I believe this question is better suited for comp.programming or
similar...
The complexity here is that I won't know the exact format of the files
until run time. The files may or may not contain headers and trailers
and the format is not known yet. The records may have clearly defined
start and end markers but they may not. There may be a fixed separator
between the records or there may not. (Separators will be used if
there are no record start and end markers).
I don't really understand how you're going to cater for this level of
indeterminacy.
The current idea is to use UNIX regular expressions to define the
format of the parts of the file and match them up at run time. However
it is not clear whether it would be possible to develop single
expressions for the whole file or whether I would have to use separate
regular expressions for each part of the file (header, trailer,
separator, begin/end record etc.). If a single expression is used I
would imagine the expression would match all the data records rather
than being able to recognise individual records.

If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).

Still, nothing C-specific here. Try some other groups.

--
BR, Vladimir

Mar 9 '06 #2

Thank for your response.

I believe this question is better suited for comp.programming or
similar...
It is posted to comp.programming (and crossposted to comp.lang.c)
If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).
This seems even more difficult than the ideas I discussed. Maybe I did
not
explain the requirements well. The program has to cope with a variety
of
different file formats. Hence the need to make the program flexible.
The
file format would be specified in a database or configuation file and
would be
fixed for any particular instance of the program. However there will
be many
such programs running on different installations all reading different
file formats.
Still, nothing C-specific here. Try some other groups.

It's got to be written in C. I think that is specific :-)

M

Mar 9 '06 #3

Vladimir S. Oka

NB: Posted just to comp.lang.c

M wrote:

Thank for your response.
I believe this question is better suited for comp.programming or
similar...

It is posted to comp.programming (and crossposted to comp.lang.c)

Sorry, I did not see this.

If you at least know the limits of what can be expected, why don't you
come up with a simple(ish) file description language, and pre-pend it
(or use it as a header).

This seems even more difficult than the ideas I discussed. Maybe I did
not explain the requirements well. The program has to cope with a variety
of different file formats. Hence the need to make the program flexible.
The file format would be specified in a database or configuation file and
would be fixed for any particular instance of the program. However there will
be many such programs running on different installations all reading different
file formats.

You suggested regular expressions. I suggested a simplified form (in
different words), specific to your implementation. Where the
description is stored is really immaterial.

Still, nothing C-specific here. Try some other groups.

It's got to be written in C. I think that is specific :-)

You're really after the method, which can be implemented in any
language.

This group (c.l.c) discusses the C language only. Once you implement
this in C (or start implementing it), and have a question about
/implementation/ using standard C, this is the place to ask about it.
(Although, as you will have noticed, we do tend to give it a stab,
while pointing to the better place to ask. ;-) )

--
BR, Vladimir

Mar 9 '06 #4

Richard Heathfield

M said:

Hi,

I need to parse text files to extract data records. The files will
consist of a header,
zero or more data records, and a trailer. I can discard the header and
trailer but I must split the data records up and return them to an
application.

The complexity here is that I won't know the exact format of the files
until run time.
Been there, done that, got the tee-shirt in several different shapes and
sizes. We ended up writing a data language. (Well, I say we, but I had very
little to do with it actually.) I'm fairly sure I've described it here
before. A descriptor file (text, of course) was used to identify which
fields were present in which locations and how wide they were, that sort of
thing.
The files may or may not contain headers and trailers
and the format is not known yet.
You just said they would have a header and a trailer. The exact format may
be a moveable feast, but you need to establish a consistent meta-format
early on.
I would be grateful for some thoughts on how this could be achieved.

Let's say you wanted to write a C interpreter. (Analogy alert!) To process a
struct definition, you'd have to read it in from the text file, identify
the type of each member, and its name, and (if it's an array) its size. And
you'd have to have some way of finding or updating a particular member's
value, given its name.

You have much the same deal here. Your record is like a C struct, in a way.
(But not in another way. For reading and processing, you will almost
certainly want to be able to access the various fields of a record in a
loop - at least sometimes.) So that gives you a clue about your
configuration file structure. Say, for example, that you are dealing with
orders for nuts and bolts from fifteen different large customers, all of
whom send their orders to you electronically. You might want to have a
config file structure something like this:

FILETYPE Orders
CUSTOMER NutsNBoltsRUs
DEF RECORD Header
CHAR Type
DATE Created
INTEGER RecordCount
ENDDEF
DEF RECORD Bolts
CHAR Type
DATE OrderDate
CHAR 16 ProductCode
STRING Description *
INTEGER Height
INTEGER TopDiameter
CHAR 3 DontCareA
INTEGER TipDiameter
CHAR 3 DontCareB
INTEGER PitchCode
CHAR 6 DontCareC
INTEGER PriceCode
ENDDEF
DEF RECORD Nuts
CHAR Type
DATE OrderDate
CHAR 14 ProductCode
STRING Description *
INTEGER MatCode
INTEGER Depth
INTEGER ExternalDiameter
INTEGER InternalDiameter
INTEGER PitchCode
INTEGER PriceCode
CHAR 12 DontCareD
INTEGER ColourCode
ENDDEF

As you can see, this is easily extensible, and its purpose is to describe
the file format supplied by a particular customer. Thus, its layout will
vary depending on that format. The above example contains some fields that
we simply aren't interested in, but we have to know enough about them to be
able to ignore them - hence the "DontCare" entries. And at runtime, you
simply read the config file to find out where in a record the relevant
field information was. You'll end up with functions to read a record, work
out what record type it is, find a field within a given record either by
name or by index, etc etc. Nothing terribly hard, but needs careful
planning.
--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: rjh at above domain (but drop the www, obviously)

Mar 9 '06 #5

Programming Master

It is impossible to use regex w/o knowing the file formats.

If you can provide further information on what you want to do with your
program, and I will try to provide some further assistance.

Mar 10 '06 #6

Oliver Wong

"Programming Master" <ei*********@gmail.com> wrote in message
news:11**********************@e56g2000cwe.googlegr oups.com...

It is impossible to use regex w/o knowing the file formats.

If you can provide further information on what you want to do with your
program, and I will try to provide some further assistance.

I think the OP is saying the program WILL know the file formats...
except only at runtime, instead of at compile time.

Mar 10 '06 #7

> I think the OP is saying the program WILL know the file formats...

except only at runtime, instead of at compile time.

Correct. The program will have to cope with many different file
formats (conforming to the specification from my original post). The
exact format will be known at run time and may be specified in terms of
regular expressions.

The purpose of this application is to interpret data files from many
different clients. Each
client uses a slightly different file format. My program has to be
able to read all the files.

I have now completed a prototype, based on the provision of five
different regular expressions to define a file format. It would be
nice to reduce the number of
expressions necessary - but I can't see a way of doing this. This is
really what the
original post was about - using a single RE.

Mark

Mar 13 '06 #8

Similar topics

Using file streams in DLL

by: Saulius | last post by:

Hello, I am experiencing a strange problem using file streams in a DLL library. I am using Borland C++ Builder 5.0 Professional and I am trying to simply read a file using ifstream inside a DLL....

C / C++

extracting records from a file

by: Aleander | last post by:

Hi! I have to write the records of a vector in a file, e and then open this file to extract the record to refill the vector. My program has two class: Visita(Appointment) and Data(date). The...

C / C++

Parsing a string using istringstream

by: Adam Parkin | last post by:

Hello all, I'm trying to write a function which given a std::string parses the string by breaking the sentance up by whitespace (\t, ' ', \n) and returns the result as a vector of strings. Here's...

C / C++

C SHARP - Parsing URL for Variable

by: Jozef Jarosciak | last post by:

Hi everyone, I am building a web crawler and one of the features which I need to include is exclusion of specified 'variable + value' from the url. Example, user wanted to extract variable...

C# / C Sharp

Reading - Parsing Records From An LDAP LDIF File In .Net?

by: Jean-Marie Vaneskahian | last post by:

Reading - Parsing Records From An LDAP LDIF File In .Net? I am in need of a .Net class that will allow for the parsing of a LDAP LDIF file. An LDIF file is the standard format for representing...

Visual Basic .NET

Problem with parsing double value from xml file

by: Martin PÃ¶pping | last post by:

Hello, IÂ´ve a problem with parsing a double value from an xml file. My code looks like this: int concept_id; double rank; XmlElement root = documentXMLString.DocumentElement; XmlNodeList...

C# / C Sharp

Parsing HTML

by: mtuller | last post by:

Alright. I have tried everything I can find, but am not getting anywhere. I have a web page that has data like this: <tr > <td headers="col1_1" style="width:21%" > <span class="hpPageText"...

Python

Parsing an email to find a mailing address?

by: Terry Olsen | last post by:

I have a very interesting request. A customer receives orders via email. The email contains the shipping address, shipping method, email address and phone number. This information is not all...

Visual Basic .NET

MultiExtractor file ripper

by: Maciej =?iso-8859-2?Q?Drobi=F1ski?= | last post by:

1. Download and screenshots 2. Information 3. Version history 1. Download and screenshots Main web site - http://www.multiextractor.com Screenshots -...

C / C++

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Changing the language in Windows 10

by: Hystou | last post by:

Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...

Windows Server

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

Discussion: How does Zigbee compare with other wireless protocols in smart home applications?

by: tracyyun | last post by:

Dear forum friends, With the development of smart home technology, a variety of wireless communication protocols have appeared on the market, such as Zigbee, Z-Wave, Wi-Fi, Bluetooth, etc. Each...

General