By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
444,137 Members | 2,209 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 444,137 IT Pros & Developers. It's quick & easy.

Best way to parse file into db-type layout?

P: n/a
I've got a file that seems to come across more like a dictionary from what I can
tell. Something like the following format:

###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10
###,1,val_1,2,val_2,3,val_3,5,val_5,11,val_11,25,v al_25,967,val_967

In other words, different layouts (defined mostly by what is in val_1, val_2,
val_3).

The ,#, fields indicate what "field" from our mainframe the corresponding value
represents.

Is there a good way to parse this into a DB-type format where I only pull out
the values corresponding to the appropriate field numbers? Seems like
converting to a dictionary of some sort would be best, but I don't quite know
how I would go about that.

In this case, the first field is a value unto itself - represents a "letter
type" that would be associated with the rest of the record. The fields are
either present or not, no placeholder if there's no value for e.g. Field #4.

Thanks for any help or pointers you can give me.

-Pete
Jul 19 '05 #1
Share this Question
Share on Google+
19 Replies


P: n/a
On Thu, 28 Apr 2005 23:34:31 GMT, Peter A. Schott
<pa******@no.yahoo.spamm.com> wrote:
I've got a file that seems to come across more like a dictionary from what I can
tell. Something like the following format:

###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10
###,1,val_1,2,val_2,3,val_3,5,val_5,11,val_11,25, val_25,967,val_967

In other words, different layouts (defined mostly by what is in val_1, val_2,
val_3).

The ,#, fields indicate what "field" from our mainframe the corresponding value
represents.

Is there a good way to parse this into a DB-type format where I only pull out
the values corresponding to the appropriate field numbers? Seems like
converting to a dictionary of some sort would be best, but I don't quite know
how I would go about that.

In this case, the first field is a value unto itself - represents a "letter
type" that would be associated with the rest of the record. The fields are
either present or not, no placeholder if there's no value for e.g. Field #4.


Here's a sketch, tested as you'll see, but totally devoid of the
error-checking that would be necessary for any data coming from an MF.

C:\junk>type schott.txt
pers,1,xxx,2,yyy,3,zzz,100,SMITH,101,JOHN,102,ALOY SIUS,103,1969-12-31
addr,1,qqq,2,www,3,eee,200,"""THE LODGE"", 123 MAIN ST",205,WALLA
WALLA,206,WA

C:\junk>type schott.py
import csv
for row in csv.reader(open('schott.txt', 'rb')):
rectype = row[0]
recdict = {}
for k in range(1, len(row), 2):
recdict[int(row[k])] = row[k+1]
print rectype, recdict

C:\junk>python schott.py
pers {1: 'xxx', 2: 'yyy', 3: 'zzz', 100: 'SMITH', 101: 'JOHN', 102:
'ALOYSIUS', 103: '1969-12-31'}
addr {1: 'qqq', 2: 'www', 3: 'eee', 200: '"THE LODGE", 123 MAIN ST',
205: 'WALLA WALLA', 206: 'WA'}

Hint: you'll probably go nuts if you don't implement some sort of
naming convention instead of those numbers.

One way would be like this:

mf_surname = 100
mf_given_1 = 101
....
mf_state = 206

then you can refer to recdict[mf_state] instead of recdict[206].

Going upmarket a bit:

Have a mf_map = {100: 'surname', 206: 'state', } # etc etc

then you do

class Record(object):
pass

# for each row:
rec = Record()
rec.rectype = row[0]
for k in range(1, len(row), 2):
setattr(rec, mf_map[int(row[k])], row[k+1])

Then you can refer to rec.state instead of recdict[mf_state] or
recdict[206].

Further upmarket would involve gathering basic "type" information
about the MF fields (free text, alpha code, identifier (e.g. SSN),
money, quantity, date, etc etc) so that you can do validations and
format conversions as appropriate.

HTH,
John

Jul 19 '05 #2

P: n/a
Peter A. Schott wrote:
I've got a file that seems to come across more like a dictionary from what I can
tell. Something like the following format:

###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10
###,1,val_1,2,val_2,3,val_3,5,val_5,11,val_11,25,v al_25,967,val_967


Peter, I'm not sure exactly what you want. Perhaps a dictionary for each
row in the file? Where the first row would result in:

{"letter_type": "###", 1: "val_1", 2: "val_2", 3: "val_3", 5: "val_5",
10: "val_10"}

Something like this:

import csv
import fileinput

row_dicts = []
for row in csv.reader(fileinput.input()):
row_dict = dict(letter_type=row[0])

for col_index in xrange(1, len(row), 2):
row_dict[int(row[col_index])] = row[col_index+1]

row_dicts.append(row_dict)

Someone else might come up with something more elegant.
--
Michael Hoffman
Jul 19 '05 #3

P: n/a
On Fri, 29 Apr 2005 01:44:30 +0100, Michael Hoffman
<ca*******@mh391.invalid> wrote:
for row in csv.reader(fileinput.input()):


csv.reader requires that if the first arg is a file that it be opened
in binary mode.
Jul 19 '05 #4

P: n/a
That looks promising. The field numbers are pre-defined at the mainframe level.
This may help me get to my ultimate goal which is to pump these into a DB on a
row-by-row basis ( :-P ) I'll have to do some playing around with this. I
knew that it looked like a dictionary, but wasn't sure how best to handle this.

One follow-up question: I'll end up getting multiple records for each "type".
Would I be referencing these by row[#][field#]?

Minor revision to the format is that starts like:
###,1,1,val_1,....
I think right now the plan is to parse through the file and insert the pairs
directly into a DB table. Something like RowID, LetterType, date, Field#,
Value. I can get RowID and LetterType overall, date is a constant, the rest
would involve reading each pair and inserting both values into the table. Time
to hit the books a little more to get up to speed on all of this.

I really appreciate the help. I knew there had to be a better way to do this,
just wasn't sure what it was.

-Pete
John Machin <sj******@lexicon.net> wrote:
On Thu, 28 Apr 2005 23:34:31 GMT, Peter A. Schott
<pa******@no.yahoo.spamm.com> wrote:
I've got a file that seems to come across more like a dictionary from what I can
tell. Something like the following format:

###,1,val_1,2,val_2,3,val_3,5,val_5,10,val_10
###,1,val_1,2,val_2,3,val_3,5,val_5,11,val_11,25, val_25,967,val_967

In other words, different layouts (defined mostly by what is in val_1, val_2,
val_3).

The ,#, fields indicate what "field" from our mainframe the corresponding value
represents.

Is there a good way to parse this into a DB-type format where I only pull out
the values corresponding to the appropriate field numbers? Seems like
converting to a dictionary of some sort would be best, but I don't quite know
how I would go about that.

In this case, the first field is a value unto itself - represents a "letter
type" that would be associated with the rest of the record. The fields are
either present or not, no placeholder if there's no value for e.g. Field #4.


Here's a sketch, tested as you'll see, but totally devoid of the
error-checking that would be necessary for any data coming from an MF.

C:\junk>type schott.txt
pers,1,xxx,2,yyy,3,zzz,100,SMITH,101,JOHN,102,ALOY SIUS,103,1969-12-31
addr,1,qqq,2,www,3,eee,200,"""THE LODGE"", 123 MAIN ST",205,WALLA
WALLA,206,WA

C:\junk>type schott.py
import csv
for row in csv.reader(open('schott.txt', 'rb')):
rectype = row[0]
recdict = {}
for k in range(1, len(row), 2):
recdict[int(row[k])] = row[k+1]
print rectype, recdict

C:\junk>python schott.py
pers {1: 'xxx', 2: 'yyy', 3: 'zzz', 100: 'SMITH', 101: 'JOHN', 102:
'ALOYSIUS', 103: '1969-12-31'}
addr {1: 'qqq', 2: 'www', 3: 'eee', 200: '"THE LODGE", 123 MAIN ST',
205: 'WALLA WALLA', 206: 'WA'}

Hint: you'll probably go nuts if you don't implement some sort of
naming convention instead of those numbers.

One way would be like this:

mf_surname = 100
mf_given_1 = 101
...
mf_state = 206

then you can refer to recdict[mf_state] instead of recdict[206].

Going upmarket a bit:

Have a mf_map = {100: 'surname', 206: 'state', } # etc etc

then you do

class Record(object):
pass

# for each row:
rec = Record()
rec.rectype = row[0]
for k in range(1, len(row), 2):
setattr(rec, mf_map[int(row[k])], row[k+1])

Then you can refer to rec.state instead of recdict[mf_state] or
recdict[206].

Further upmarket would involve gathering basic "type" information
about the MF fields (free text, alpha code, identifier (e.g. SSN),
money, quantity, date, etc etc) so that you can do validations and
format conversions as appropriate.

HTH,
John


Jul 19 '05 #5

P: n/a
On Fri, 29 Apr 2005 18:54:54 GMT, Peter A. Schott
<pa******@no.yahoo.spamm.com> wrote:
That looks promising. The field numbers are pre-defined at the mainframe level.
Of course. Have you managed to acquire a copy of the documentation, or
do you have to reverse-engineer it?
This may help me get to my ultimate goal which is to pump these into a DB on a
row-by-row basis ( :-P )
That's your *ultimate* goal? Are you running a retro-computing museum
or something? Don't you want to *USE* the data?
I'll have to do some playing around with this. I
knew that it looked like a dictionary, but wasn't sure how best to handle this.

One follow-up question: I'll end up getting multiple records for each "type".
What does that mean?? If it means that more than one customer will get
the "please settle your account" letter, and more than one customer
will get the "please buy a spangled fritzolator, only $9.99" letter,
you are stating the obvious -- otherwise, please explain.
Would I be referencing these by row[#][field#]?
Not too sure what you mean by that -- whether you can get away with a
(read a row, write a row) way of handling the data depends on its
structure (like what are the relationships if any between different
rows) and what you want to do with it -- both murky concepts at the
moment.

Minor revision to the format is that starts like:
###,1,1,val_1,....
How often do these "minor revisions" happen? How flexible do you have
to be? And the extra "1" means what? Is it ever any other number?


I think right now the plan is to parse through the file and insert the pairs
directly into a DB table. Something like RowID, LetterType, date, Field#,
Value.
Again, I'd recommend you lose the "Field#" in favour of a better
representation, ASAP.
I can get RowID and LetterType overall, date is a constant, the rest
would involve reading each pair and inserting both values into the table. Time
to hit the books a little more to get up to speed on all of this.


What you need is (a) a clear appreciation of what you are trying to do
with the data at a high level (b) then develop an understanding of
what is the underlying data model (c) then and only then worry about
technical details.

Good luck,
John
Jul 19 '05 #6

P: n/a
John Machin wrote:
[Michael Hoffman]:
for row in csv.reader(fileinput.input()):


csv.reader requires that if the first arg is a file that it be opened
in binary mode.


fileinput.input() is not a file.

I have tested this code and it works fine for the provided example.
--
Michael Hoffman
Jul 19 '05 #7

P: n/a
On Fri, 29 Apr 2005 23:21:43 +0100, Michael Hoffman
<ca*******@mh391.invalid> wrote:
John Machin wrote:
[Michael Hoffman]:
for row in csv.reader(fileinput.input()):
csv.reader requires that if the first arg is a file that it be opened
in binary mode.


fileinput.input() is not a file.


Hair-splitter. fileinput opens its files in text mode.

It's an awk simulation and shouldn't be used for real-world data.

I have tested this code and it works fine for the provided example.


Well I've got news for you: real-world data has embedded CRs, LFs and
(worst of all) ^Zs often enough, and you won't find them mentioned in
any documentation, nor find them in examples.
Jul 19 '05 #8

P: n/a
John Machin wrote:
[Michael Hoffman]:
John Machin wrote:
[Michael Hoffman]:

for row in csv.reader(fileinput.input()):

csv.reader requires that if the first arg is a file that it be opened
in binary mode.
fileinput.input() is not a file.


Hair-splitter.


Is name-calling really necessary?
It's an awk simulation and shouldn't be used for real-world data.


I don't see why not, so long as your data is text.
I have tested this code and it works fine for the provided example.


Well I've got news for you: real-world data has embedded CRs, LFs and
(worst of all) ^Zs often enough, and you won't find them mentioned in
any documentation, nor find them in examples.


That's nice. Well I agree with you, if the OP is concerned about embedded
CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
then he *definitely* shouldn't use fileinput.

And otherwise, there's really no reason not to.
--
Michael Hoffman
Jul 19 '05 #9

P: n/a
On Sat, 30 Apr 2005 00:40:50 +0100, Michael Hoffman
<ca*******@mh391.invalid> wrote:
John Machin wrote:
[Michael Hoffman]:
John Machin wrote:
[Michael Hoffman]:

>for row in csv.reader(fileinput.input()):

csv.reader requires that if the first arg is a file that it be opened
in binary mode.

fileinput.input() is not a file.
Hair-splitter.


Is name-calling really necessary?


I beg your pardon. How does: "Your point addresses the letter rather
than the spirit of the 'law'" sound?
It's an awk simulation and shouldn't be used for real-world data.
I don't see why not, so long as your data is text.


Real-world data is not "text".
I have tested this code and it works fine for the provided example.
Well I've got news for you: real-world data has embedded CRs, LFs and
(worst of all) ^Zs often enough, and you won't find them mentioned in
any documentation, nor find them in examples.


That's nice. Well I agree with you, if the OP is concerned about embedded
CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
then he *definitely* shouldn't use fileinput.


And if the OP is naive enough not to be concerned, then it's OK, is
it?

And otherwise, there's really no reason not to.


Except, perhaps, the reason stated in fileinput.py itself:

"""
Performance: this module is unfortunately one of the slower ways of
processing large numbers of input lines.
"""
Jul 19 '05 #10

P: n/a
John Machin wrote:
I beg your pardon. How does: "Your point addresses the letter rather
than the spirit of the 'law'" sound?
Sure, thanks.
Real-world data is not "text".
A lot of real-world data is. For example, almost all of the data I deal with
is text.
That's nice. Well I agree with you, if the OP is concerned about embedded
CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
then he *definitely* shouldn't use fileinput.


And if the OP is naive enough not to be concerned, then it's OK, is
it?


It simply isn't a problem in some real-world problem domains. And if there
are control characters the OP didn't expect in the input, and csv loads it
without complaint, I would say that he is likely to have other problems once
he's processing it.
Except, perhaps, the reason stated in fileinput.py itself:

"""
Performance: this module is unfortunately one of the slower ways of
processing large numbers of input lines.
"""


Fair enough, although Python is full of useful things that save the
programmer's time at the expense of that of the CPU, and this is
frequently considered a Good Thing.

Let me ask you this, are you simply opposed to something like fileinput
in principle or is it only because of (1) no binary mode, and (2) poor
performance? Because those are both things that could be fixed. I think
fileinput is so useful that I'm willing to spend some time working on it
when I have some.
--
Michael Hoffman
Jul 19 '05 #11

P: n/a
On Sat, 30 Apr 2005 11:35:05 +0100, Michael Hoffman
<ca*******@mh391.invalid> wrote:
John Machin wrote:
Real-world data is not "text".


A lot of real-world data is. For example, almost all of the data I deal with
is text.


OK, depends on one's definitions of "data" and "text". In the domain
of commercial database applications, there is what's loosely called
"text": entity names, and addresses, and product descriptions, and the
dreaded free-text "note" columns -- all of which (not just the
"notes") one can end up parsing trying to extract extraneous data
that's been dumped in there ... sigh ...
That's nice. Well I agree with you, if the OP is concerned about embedded
CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
then he *definitely* shouldn't use fileinput.
And if the OP is naive enough not to be concerned, then it's OK, is
it?


It simply isn't a problem in some real-world problem domains. And if there
are control characters the OP didn't expect in the input, and csv loads it
without complaint, I would say that he is likely to have other problems once
he's processing it.


Presuming for the moment that the reason for csv not complaining is
that the data meets the csv non-spec and that the csv module is
checking that: then at least he's got his data in the structural
format he's expecting; if he doesn't do any/enough validation on the
data, we can't save him from that.
Except, perhaps, the reason stated in fileinput.py itself:

"""
Performance: this module is unfortunately one of the slower ways of
processing large numbers of input lines.
"""


Fair enough, although Python is full of useful things that save the
programmer's time at the expense of that of the CPU, and this is
frequently considered a Good Thing.

Let me ask you this, are you simply opposed to something like fileinput
in principle or is it only because of (1) no binary mode, and (2) poor
performance? Because those are both things that could be fixed. I think
fileinput is so useful that I'm willing to spend some time working on it
when I have some.


I wouldn't use fileinput for a "commercial data processing" exercise,
because it's slow, and (if it involved using the Python csv module) it
opens the files in text mode, and because in such exercises I don't
often need to process multiple files as though they were one file.

When I am interested in multiple files -- more likely a script that
scans source files -- even though I wouldn't care about the speed nor
the binary mode, I usually do something like:

for pattern in args: # args from an optparse parser
for filename in glob.glob(pattern):
for line in open(filename):

There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.

Cheers,
John

Jul 19 '05 #12

P: n/a
John Machin wrote:
[...]

I wouldn't use fileinput for a "commercial data processing" exercise,
because it's slow, and (if it involved using the Python csv module) it
opens the files in text mode, and because in such exercises I don't
often need to process multiple files as though they were one file.
If the process runs once a month, and take ten minutes to process the
required data, isn't that fast enough. It's unwise to act as though
"slow" is an absolute term.
When I am interested in multiple files -- more likely a script that
scans source files -- even though I wouldn't care about the speed nor
the binary mode, I usually do something like:

for pattern in args: # args from an optparse parser
for filename in glob.glob(pattern):
for line in open(filename):

There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.

But if it happens to be convenient for the task at hand why deny the OP
the use of a tool that can solve a problem? We shouldn't be so purist
that we create extra (and unnecessary) work :-), and principles should
be tempered with pragmatism in the real world.

regards
Steve
--
Steve Holden +1 703 861 4237 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/

Jul 19 '05 #13

P: n/a
John Machin wrote:
That's nice. Well I agree with you, if the OP is concerned about embedded
CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
then he *definitely* shouldn't use fileinput.

And if the OP is naive enough not to be concerned, then it's OK, is
it?
It simply isn't a problem in some real-world problem domains. And if there
are control characters the OP didn't expect in the input, and csv loads it
without complaint, I would say that he is likely to have other problems once
he's processing it.


Presuming for the moment that the reason for csv not complaining is
that the data meets the csv non-spec and that the csv module is
checking that: then at least he's got his data in the structural
format he's expecting; if he doesn't do any/enough validation on the
data, we can't save him from that.


What if the input is UTF-16? Your solution won't work for that. And there
are certainly UTF-16 CSV files out in the wild.

I think at some point you have to decide that certain kinds of data
are not sensible input to your program, and that the extra hassle in
programming around them is not worth the benefit.
There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.


Yes, indeed. I never use those, and would probably do something akin to what
you are suggesting rather than doing so. I simply enjoy the no-hassle
simplicity of fileinput.input() rather than worrying about whether my data
will be piped in, or in file(s) specified on the command line.
--
Michael Hoffman
Jul 19 '05 #14

P: n/a
On Sat, 30 Apr 2005 09:23:16 -0400, Steve Holden <st***@holdenweb.com>
declaimed the following in comp.lang.python:
If the process runs once a month, and take ten minutes to process the
required data, isn't that fast enough. It's unwise to act as though
"slow" is an absolute term.
Now you're getting into the (very) loose definition of
"real-time" I used in college: Responding to input fast enough to not
have an impact on the /next/ input... (which makes a weekend
payroll/check printing run "real-time" as far as payroll accounting is
concerned).

-- ================================================== ============ <
wl*****@ix.netcom.com | Wulfraed Dennis Lee Bieber KD6MOG <
wu******@dm.net | Bestiaria Support Staff <
================================================== ============ <
Home Page: <http://www.dm.net/~wulfraed/> <
Overflow Page: <http://wlfraed.home.netcom.com/> <

Jul 19 '05 #15

P: n/a
On Sat, 30 Apr 2005 09:23:16 -0400, Steve Holden <st***@holdenweb.com>
wrote:
John Machin wrote:
[...]

I wouldn't use fileinput for a "commercial data processing" exercise,
because it's slow, and (if it involved using the Python csv module) it
opens the files in text mode, and because in such exercises I don't
often need to process multiple files as though they were one file.
If the process runs once a month, and take ten minutes to process the
required data, isn't that fast enough.


Depends: (1) criticality: could it have been made to run in 5 minutes,
avoiding the accountant missing the deadline to EFT the taxes to the
government (or, worse, missing the last train home)?

(2) "Many a mickle makes a muckle": the total of all run times could
be such that overnight processing doesn't complete before the day
shift turns up ...
It's unwise to act as though
"slow" is an absolute term.
When I am interested in multiple files -- more likely a script that
scans source files -- even though I wouldn't care about the speed nor
the binary mode, I usually do something like:

for pattern in args: # args from an optparse parser
for filename in glob.glob(pattern):
for line in open(filename):

There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.

But if it happens to be convenient for the task at hand why deny the OP
the use of a tool that can solve a problem? We shouldn't be so purist
that we create extra (and unnecessary) work :-), and principles should
be tempered with pragmatism in the real world.


If the job at hand is simulating awk's file reading habits, yes then
fileinput is convenient. However if the job at hand involves anything
like real-world commercial data processing requirements then fileinput
is NOT convenient.

Example 1: Requirement is, for each input file, to display name of
file, number of records, and some data totals.

Example 2: Requirement is, if end of file occurs when not expected
(including, but not restricted to, the case of zero records) display
an error message and terminate abnormally.

I'd like to see some code for example 1 that used fileinput (on a list
of filenames) and didn't involve "extra (and unnecessary) work"
compared to the "for filename in alist / f = open(filename) / for line
in f" way of doing it.

If fileinput didn't exist, what do you think the reaction would be if
you raised a PEP to include it in the core?

Jul 19 '05 #16

P: n/a
On Sat, 30 Apr 2005 14:31:08 +0100, Michael Hoffman
<ca*******@mh391.invalid> wrote:
John Machin wrote:
>That's nice. Well I agree with you, if the OP is concerned about embedded
>CRs, LFs and ^Zs in his data (and he is using Windows in the latter case),
>then he *definitely* shouldn't use fileinput.

And if the OP is naive enough not to be concerned, then it's OK, is
it?

It simply isn't a problem in some real-world problem domains. And if there
are control characters the OP didn't expect in the input, and csv loads it
without complaint, I would say that he is likely to have other problems once
he's processing it.
Presuming for the moment that the reason for csv not complaining is
that the data meets the csv non-spec and that the csv module is
checking that: then at least he's got his data in the structural
format he's expecting; if he doesn't do any/enough validation on the
data, we can't save him from that.


What if the input is UTF-16? Your solution won't work for that. And there
are certainly UTF-16 CSV files out in the wild.


The csv module docs do say that Unicode is not supported.

This does appear to work, however, at least for data that could in
fact be encoded as ASCII:
import codecs
j = codecs.open('utf16junk.txt', 'rb', 'utf-16')
rdr = csv.reader(j, delimiter='\t')
rows = list(rdr)


The usual trick to smuggle righteous data past the heathen (recode as
UTF-8, cross the border, decode) should work. However the OP's data is
coming from an MF, not from Excel "save as Unicode text" (which
produces a tab-delimited .txt file -- how do you get a UTF-16 CSV
file?) and if it's not in ASCII it may have a bit more chance of being
in EBCDIC than UTF-16 -- unless MFs have come a long way since I last
had anything to do with them :-)

In any case, my "solution" was a sketch, and stated to be such. We
don't know, and I suspect the OP doesn't know, exactly (1) what
encoding is being used (2) what the rules are about quoting the
delimiter, and quoting the quote character. It's very possible even if
it's encoded in ASCII and the delimiter is a comma that the quoting
system being used is not the expected Excel-like method but something
else and hence the csv module can't be used.

I think at some point you have to decide that certain kinds of data
are not sensible input to your program, and that the extra hassle in
programming around them is not worth the benefit.
I prefer to decide at a very early point what is sensible input to a
program, and then try to ensure that nonsensible input neither goes
unnoticed nor crashes with an unhelpful message.
There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.


Yes, indeed. I never use those, and would probably do something akin to what
you are suggesting rather than doing so. I simply enjoy the no-hassle
simplicity of fileinput.input() rather than worrying about whether my data
will be piped in, or in file(s) specified on the command line.


Good, now we're singing from the same hymnbook :-)

Jul 19 '05 #17

P: n/a
John Machin wrote:
On Sat, 30 Apr 2005 09:23:16 -0400, Steve Holden <st***@holdenweb.com>
wrote:

John Machin wrote:
[...]
I wouldn't use fileinput for a "commercial data processing" exercise,
because it's slow, and (if it involved using the Python csv module) it
opens the files in text mode, and because in such exercises I don't
often need to process multiple files as though they were one file.

If the process runs once a month, and take ten minutes to process the
required data, isn't that fast enough.

Depends: (1) criticality: could it have been made to run in 5 minutes,
avoiding the accountant missing the deadline to EFT the taxes to the
government (or, worse, missing the last train home)?

Get real: if that's the the timeline you don't need new software, you
need a new accountant.
(2) "Many a mickle makes a muckle": the total of all run times could
be such that overnight processing doesn't complete before the day
shift turns up ...
Again, get real and stop nitpicking.
It's unwise to act as though
"slow" is an absolute term.
When I am interested in multiple files -- more likely a script that
scans source files -- even though I wouldn't care about the speed nor
the binary mode, I usually do something like:

for pattern in args: # args from an optparse parser
for filename in glob.glob(pattern):
for line in open(filename):

There is also an "on principle" element to it as well -- with
fileinput one has to use the awkish methods like filelineno() and
nextfile(); strikes me as a tricksy and inverted way of doing things.


But if it happens to be convenient for the task at hand why deny the OP
the use of a tool that can solve a problem? We shouldn't be so purist
that we create extra (and unnecessary) work :-), and principles should
be tempered with pragmatism in the real world.

If the job at hand is simulating awk's file reading habits, yes then
fileinput is convenient. However if the job at hand involves anything
like real-world commercial data processing requirements then fileinput
is NOT convenient.

Yet again, get real. If someone tells me that fileinput meets their
requirements who am I (not to mention who are *you*) to say they should
invest extra effort in solving their problem some other way?
Example 1: Requirement is, for each input file, to display name of
file, number of records, and some data totals.

Example 2: Requirement is, if end of file occurs when not expected
(including, but not restricted to, the case of zero records) display
an error message and terminate abnormally.
Possibly these examples would have some force if they weren't simply
invented.
I'd like to see some code for example 1 that used fileinput (on a list
of filenames) and didn't involve "extra (and unnecessary) work"
compared to the "for filename in alist / f = open(filename) / for line
in f" way of doing it.

If fileinput didn't exist, what do you think the reaction would be if
you raised a PEP to include it in the core?

Why should such speculation interest me?

regards
Steve
--
Steve Holden +1 703 861 4237 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/

Jul 19 '05 #18

P: n/a
John Machin wrote:
[Michael Hoffman]:
What if the input is UTF-16? Your solution won't work for that. And there
are certainly UTF-16 CSV files out in the wild.
The csv module docs do say that Unicode is not supported.

This does appear to work, however, at least for data that could in
fact be encoded as ASCII:


And for data that can't be expressed as ASCII? It doesn't work.

So throw out csv, just like fileinput. After all, despite its utility,
and the fact that you obviously suspect the OP will never have to deal
with UTF-16 (otherwise you would have suggested this without prompting),
it won't work for *every* conceivable case.
The usual trick to smuggle righteous data past the heathen (recode as
UTF-8, cross the border, decode) should work.


True, but that's a lot of trouble to go to for something that you expect
will never happen, and for a script that may only be run by the
programmer who can certainly deal with the exceptions when they happen.

The range of sensible input is something to be determined by a
specification, or the programmer if no spec exists. Not by kibitzers[1]
speaking on high from c.l.p. <wink>
--
Michael Hoffman

[1] Yes, I include myself in that category.
Jul 19 '05 #19

P: n/a
On Sat, 30 Apr 2005 23:11:48 -0400, Steve Holden <st***@holdenweb.com>
wrote:
John Machin wrote:
If the job at hand is simulating awk's file reading habits, yes then
fileinput is convenient. However if the job at hand involves anything
like real-world commercial data processing requirements then fileinput
is NOT convenient.

Yet again, get real. If someone tells me that fileinput meets their
requirements who am I (not to mention who are *you*) to say they should
invest extra effort in solving their problem some other way?


Michael Hoffmann has said that it meets his simple requirements. He
doesn't use its filelineno() and nextfile(), and says he wouldn't use
it if he needed that sort of functionality. I have no argument with
that.

If I genuinely thought that fileinput (or any other piece of software)
would not meet somebody's requirements i.e. would not solve their
problem, then what should I do? Unless bound by blood ties or
contractual obligations should I keep silent? In any case, who are
*you* to suggest I shouldn't express an opinion?
Back to fileinput: it complicates things when you want to do something
less simple, like some action at the end of each file -- you have to
poll for the first line of the next file, which means that the
end-of-each-file code has to be repeated at the end of all files.
Further, you don't get to see empty files. Hence the examples:
Example 1: Requirement is, for each input file, to display name of
file, number of records, and some data totals.

Example 2: Requirement is, if end of file occurs when not expected
(including, but not restricted to, the case of zero records) display
an error message and terminate abnormally.

Possibly these examples would have some force if they weren't simply
invented.


The only "invention" was *simplification* of genuine real-world
requirements.

Many entities receive periodically, often daily, remittances from
other entities with whom they do business. In parallel to the
remittance being paid into the recipient's bank account, there is sent
a file containing details of the breakdown of the total money amount.
At the end of the file there is a trailer record which is mandated to
contain the number of detail records and the total amount of money.

Checking the contents of the trailer record against (a) the bank
account and (b) calculated totals from the detail records is a real
requirement. So is ringing the alarm bells if end of file is detected
before the trailer record is detected (or there is any other evidence
that the file is defective). How could you possibly imagine that these
are "simply invented"?

Perhaps we should just agree that we have differing perceptions of
reality, and move on.

Cheers,
John
Jul 19 '05 #20

This discussion thread is closed

Replies have been disabled for this discussion.