splitting tables

robsom

Hi, I have a problem with a small python program I'm trying to write
and I hope somebody may help me. I'm working on tables of this kind:

CGA 1988 06 21 13 48 G500-050 D 509.62 J.. R1 1993 01 28 00 00 880006
CGA 1988 06 21 14 04 G500-051 D 550.62 J.. R1 1993 01 28 00 00 880007

I have to read each line of the table and put it into comma-separated
lists like these for later manipulation:

CGA,1988,06,21,13,48,G500-050,D,509.62,J..,R1,1993,01,28,00,00,880006
CGA,1988,06,21,14,04,G500-051,D,550.62,J..,R1,1993,01,28,00,00,880007

The 'split' function works pretty well, except when there is an error in
the original data table. For example if an element is missin in a line,
like this:

CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

This error happens quite often in my dataset and the tables are too
large to check for it manually. In this case what I get splitting the
line string is of course this:

CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22,D,508.06,J..,R1,1993,01,27,00 ,00,900065

And when the program tries to work on the second list it stops (of course!).
Is there any way to avoid this problem? This kind of error happens quite
often in my dataset and the tables are usually too large to check for it
manually. Thanks a lot for any suggestions.

R

Jul 18 '05 #1

Subscribe Reply

2046

Skip Montanaro

r> The 'split' function works pretty well, except when there is an error in
r> the original data table. For example if an element is missin in a line,
r> like this:

r> CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
r> CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

If the data are truly fixed width, just slice the strings:

s 'CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065' s[0:3], s[4:8], s[9:11], s[12:14], s[15:17], s[18:20], s[21:29], s[30:31]

('CGA', '1990', '08', '16', '01', '22', ' ', 'D')

Skip

Jul 18 '05 #2

Karl Pflästerer

On 7 Feb 2004, robsom <- no*****@no.mail.it wrote:

And when the program tries to work on the second list it stops (of course!).
Is there any way to avoid this problem? This kind of error happens quite

What do you want to be done? To see if an item is missing is trivial:
just check the length of the splitted line (a list). But what the right
action in that case is belongs to you; should the user be asked? is
always the same column missing? is it possible to distinguish the
entries without errors from each other so the programm can decide which
column is missing?

KP

--
'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe;
All mimsy were the borogoves,
And the mome raths outgrabe. "Lewis Carroll" "Jabberwocky"

Jul 18 '05 #3

Bengt Richter

On Sat, 7 Feb 2004 20:08:50 +0000 (UTC), robsom <no*****@no.mail.it> wrote:

Hi, I have a problem with a small python program I'm trying to write
and I hope somebody may help me. I'm working on tables of this kind:

CGA 1988 06 21 13 48 G500-050 D 509.62 J.. R1 1993 01 28 00 00 880006
CGA 1988 06 21 14 04 G500-051 D 550.62 J.. R1 1993 01 28 00 00 880007

I have to read each line of the table and put it into comma-separated
lists like these for later manipulation:

CGA,1988,06,21,13,48,G500-050,D,509.62,J..,R1,1993,01,28,00,00,880006
CGA,1988,06,21,14,04,G500-051,D,550.62,J..,R1,1993,01,28,00,00,880007

The 'split' function works pretty well, except when there is an error in
the original data table. For example if an element is missin in a line,
like this:

CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

This error happens quite often in my dataset and the tables are too
large to check for it manually. In this case what I get splitting the
line string is of course this:

CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22,D,508.06,J..,R1,1993,01,27,0 0,00,900065

And when the program tries to work on the second list it stops (of course!).
Is there any way to avoid this problem? This kind of error happens quite
often in my dataset and the tables are usually too large to check for it
manually. Thanks a lot for any suggestions.
s = """\ ... CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
... CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065
... """ import re
rxo = re.compile( ... '(...) (....) (..) (..) (..) (..) (........) (.) '
... '(......) (...) (..) (....) (..) (..) (..) (..) (......)'
... ) import csv
import sys
writer = csv.writer(sys.stdout)
for line in s.splitlines(): writer.writerow(*rxo.findall(line))

...
CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22, ,D,508.06,J..,R1,1993,01,27,00,00,900065

To write the csv lines to a file instead of sys.stdout, substitute (untested)
file('your_csv_output_file.csv') in place of sys.stdout in the above, and get your
lines from something like (note chopping off the trailing newline)

for line in file('your_table_file'):
line = line.rstrip('\n')

instead of

for line in s.splitlines()

If you have possible short lines that create no match, you'll need to check for those
before unpacking (by using the prefixed *) into writer.writerow's arg list.

That's it for clp today ;-)

Regards,
Bengt Richter

Jul 18 '05 #4

robsom

Il Sat, 07 Feb 2004 22:10:17 +0100, Karl Pflästerer ha scritto:

What do you want to be done? To see if an item is missing is trivial:
just check the length of the splitted line (a list). But what the right
action in that case is belongs to you; should the user be asked? is
always the same column missing? is it possible to distinguish the
entries without errors from each other so the programm can decide which
column is missing?

Ok, I'll try to give some more detail. These are files with data from
field measurements and contain information about location, time,
measurement, measurement flag, error, detection limit, calibration and
other stuff like that. The problem is that they are made by different
groups and are not always consistent in their format and I'm trying to
write a code which is as general as possible.
When a table has fixed-width columns and each element is correctly aligned
as in the example I showed you, the problem is solved by slicing the
string as Skip Montanaro suggested in his answer (thanks!), but this is
not always the case. For example I can have something like this:

47.455677 456.67
47.4558 453.8
47.46789 -9999
47.4567 456

where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.
Any suggestions will be much appreciated, thanks

R

Jul 18 '05 #5

Diez B. Roggisch

> where -9999 (or somethinbg similar) indicates there is a blank, one space

divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.
Any suggestions will be much appreciated, thanks

The data looks like the different fields have different types - e.g. floats,
years, fixed width integers and so on - so you might be able to create a
regular expression like this:

"(CGA|) ([0-9]{4}|) ([0-9][2]|)"

You then could retrieve the single groups of the match. If a line doesn't
match, you either adapt the expression, or if thats not possible, create a
error-file that you could look over manually.
--
Regards,

Diez B. Roggisch

Jul 18 '05 #6

Bengt Richter

On Sun, 8 Feb 2004 14:41:55 +0000 (UTC), robsom <no*****@no.mail.it> wrote:

Il Sat, 07 Feb 2004 22:10:17 +0100, Karl Pflästerer ha scritto:
What do you want to be done? To see if an item is missing is trivial:
just check the length of the splitted line (a list). But what the right
action in that case is belongs to you; should the user be asked? is
always the same column missing? is it possible to distinguish the
entries without errors from each other so the programm can decide which
column is missing?

Ok, I'll try to give some more detail. These are files with data from
field measurements and contain information about location, time,
measurement, measurement flag, error, detection limit, calibration and
other stuff like that. The problem is that they are made by different
groups and are not always consistent in their format and I'm trying to
write a code which is as general as possible.
When a table has fixed-width columns and each element is correctly aligned
as in the example I showed you, the problem is solved by slicing the
string as Skip Montanaro suggested in his answer (thanks!), but this is
not always the case. For example I can have something like this:

47.455677 456.67
47.4558 453.8
47.46789 -9999
47.4567 456

where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.
Any suggestions will be much appreciated, thanks

Maybe a modified regex that takes into account particular field formats?
A regex will search for things in order, so you can set one up to match
special things like -9999 while still allowing -9999.9 etc.
What do you know about each field and the separations? Is there always a full
set of fields, even if some are blank?

(BTW, my other post misleads in implying that line.rstrip('\n') is necessary
to get the regex to match).

Regards,
Bengt Richter

Jul 18 '05 #7

robsom

Il Sat, 07 Feb 2004 21:56:23 +0000, Bengt Richter ha scritto:

>>> s = """\ ... CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
... CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065
... """ >>> import re
>>> rxo = re.compile( ... '(...) (....) (..) (..) (..) (..) (........) (.) '
... '(......) (...) (..) (....) (..) (..) (..) (..) (......)'
... ) >>> import csv
>>> import sys
>>> writer = csv.writer(sys.stdout)
>>> for line in s.splitlines():Apparently my system doesn't have a
>>> writer.writerow(*rxo.findall(line))

...
CGA,1990,08,15,13,16,G500-105,D,524.45,J..,R1,1993,01,29,00,00,900069
CGA,1990,08,16,01,22, ,D,508.06,J..,R1,1993,01,27,00,00,900065

To write the csv lines to a file instead of sys.stdout, substitute
(untested) file('your_csv_output_file.csv') in place of sys.stdout in
the above, and get your lines from something like (note chopping off the
trailing newline)

for line in file('your_table_file'):
line = line.rstrip('\n')

instead of

for line in s.splitlines()

If you have possible short lines that create no match, you'll need to
check for those before unpacking (by using the prefixed *) into
writer.writerow's arg list.

I'm not sure I completely understand what you are suggesting. Use a sort
of "mask" to extract information from the line and put it into a string?
And the use some function of the csv module (which I don't have) to
manipulate this string and output it?
If I got it correct this would work only if all the elements of the
columns have the same number of characters which is not always the case as
I was saying in my reply to Karl Pflästerer above.
Any other idea? Thanks

R

Jul 18 '05 #8

Karl Pflästerer

On 8 Feb 2004, robsom <- no*****@no.mail.it wrote:

47.455677 456.67
47.4558 453.8
47.46789 -9999
47.4567 456 where -9999 (or somethinbg similar) indicates there is a blank, one space
divides the columns and the elements can have a different number of
digits. This is of course a worst-case scenario :)
That is why I used split in the beginning, but then I fall into the other
problem, when there is a missing value.

I still can't see the problem sorry; if the columns are always separated
by space you can split them with:

s = StringIO.StringIO("""47.455677 456.67 .... 47.4558 453.8
.... 47.46789 -9999
.... 47.4567 456""") for line in s: .... print ','.join(map(lambda dat: dat != '-9999' and dat or '', line.split()))
....
47.455677,456.67
47.4558,453.8
47.46789,
47.4567,456 ##or for line in s: .... print ','.join(map(lambda dat: dat != '-9999' and dat or 'MISSING', line.split()))
....
47.455677,456.67
47.4558,453.8
47.46789,MISSING
47.4567,456

I used here your -9999 as value of a mssing item. Before you map that
function across the splitted line you culd check for the correct
length to see if you have e.g. like here always two coulmns. If the
length differs it depends if it's possible to find programmatically the
missing column and just insert a space or something other.

KP

--
`Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!' "Lewis Carroll" "Jabberwocky"

Jul 18 '05 #9

Scott David Daniels

robsom wrote:

... The 'split' function works pretty well, except when there is an error
in the original data table. For example if an element is missin in a line,
like this:

CGA 1990 08 15 13 16 G500-105 D 524.45 J.. R1 1993 01 29 00 00 900069
CGA 1990 08 16 01 22 D 508.06 J.. R1 1993 01 27 00 00 900065

A way to stay reasonably fast is to use split when it works:

lengths = 3,4,2,2,2,8,1,8,3,2,4,2,2,2,2,6
...
result = line.split()
if len(result) != len(lengths):
result, pos = [], 0
for length in lengths:
next = pos + length + 1
result.append(line[pos : next].strip())
pos = next
...
--
-Scott David Daniels
Sc***********@Acm.Org

Jul 18 '05 #10

robsom

Il Sat, 07 Feb 2004 13:04:05 -0800, Scott David Daniels ha scritto:

A way to stay reasonably fast is to use split when it works:

lengths = 3,4,2,2,2,8,1,8,3,2,4,2,2,2,2,6
...
result = line.split()
if len(result) != len(lengths):
result, pos = [], 0
for length in lengths:
next = pos + length + 1
result.append(line[pos : next].strip())
pos = next
...

I think I understand what you mean, but what data type is lenghts?
is it a list or a tuple? In this case there should be the parenthesis,
shouldn't they?
Sorry, I'm really making newbie silly questions :), but I couldn't find
that in my reference manual.
In any case this would work only when all the elements in a column have
the same length, which, as I said, is not always the case. thanks a lot

R

Jul 18 '05 #11

robsom

Sorry guys, I thank all of you, but I'm a beginner and my knowledge of
Python is not yet enough to follow you.

for line in s:
...print ','.join(map(lambda dat: dat != '-9999' and dat or '', line.split()))
...

As far as I understand you are using a 'map' function to apply the
function "lambda dat: dat != '-9999' and dat or ''" to the elements
obtained with "line.split()".
First question: shouldn't it be split(line)?
Now, my book says that lambda works by applying (in this example) the
instructions in "dat != '-9999' and dat or ''" to the dat variable. Does
this mean that (I'm more or less guessing now) this code splits the line
assign each element to a dat variable which could have three types of
value (i.e. its "normal" value or -9999 or a space)? And then you use a
join functions to put again together the elements yolu have separated with
map with a comma as separator? And here I have a second question:
shouldn't it be "join(map(....),',')" instead of "','.join(map(...))"?
I used here your -9999 as value of a mssing item. Before you map that
function across the splitted line you culd check for the correct length
to see if you have e.g. like here always two coulmns. If the length
differs it depends if it's possible to find programmatically the missing
column and just insert a space or something other.

Ok, but before going on, I'd like to know if I understand the code.
Probably I got it all wrong, but this is quite a few chapters beyond where
I am now in my book :)
Morevoer I wonder if it is possible to achieve the same thing without
using the functional programming.

many thanks again

R

Jul 18 '05 #12

Karl Pflästerer

On 10 Feb 2004, robsom <- no*****@no.mail.it wrote:

Sorry guys, I thank all of you, but I'm a beginner and my knowledge of
Python is not yet enough to follow you.
If you're interested there's a vrey good mailing list for Python
starters (and others): Python Tutor (I don't know the correct address
ath the momwnr but it should be easy to find on the ython web site).

> for line in s:

...print ','.join(map(lambda dat: dat != '-9999' and dat or '',
line.split())) ...

As far as I understand you are using a 'map' function to apply the
function "lambda dat: dat != '-9999' and dat or ''" to the elements
obtained with "line.split()".
Right.
First question: shouldn't it be split(line)?
No. The split method is a method of strings. Since line is a string
you call here a method of a string object.
Now, my book says that lambda works by applying (in this example) the
instructions in "dat != '-9999' and dat or ''" to the dat variable. Does
this mean that (I'm more or less guessing now) this code splits the line
It gets splitted on spaces.
assign each element to a dat variable which could have three types of
value (i.e. its "normal" value or -9999 or a space)? And then you use a
That's nearly right. This trick with `and' and `or' is necessary since
you can't have statements in Python lambda. Written the above as `if'
statement would be:
if dat != '9999':
return dat
else:
return ''
join functions to put again together the elements yolu have separated with
`join' is here again (like `split') a method of the string object.
map with a comma as separator? And here I have a second question:
shouldn't it be "join(map(....),',')" instead of "','.join(map(...))"?
No since join is a method. If you have an older Python book it may be
written differently.

[...] Ok, but before going on, I'd like to know if I understand the code.
Probably I got it all wrong, but this is quite a few chapters beyond where
I am now in my book :)
I didn't know you were a beginner otherwise I would have written more
verbosely. The Tutor list might be the right place for you to start.
Morevoer I wonder if it is possible to achieve the same thing without
using the functional programming.

Yes, but it would be less fun :-)

KP

--
You know you've been sitting in front of your Lisp machine too long
when you go out to the junk food machine and start wondering how to
make it give you the CADR of Item H so you can get that yummie
chocolate cupcake that's stuck behind the disgusting vanilla one.

Jul 18 '05 #13

robsom

Il Tue, 10 Feb 2004 18:09:33 +0100, Karl Pflästerer ha scritto:

If you're interested there's a vrey good mailing list for Python
starters (and others): Python Tutor (I don't know the correct address
ath the momwnr but it should be easy to find on the ython web site).
I'll have a look, I'm beginning to appreciate Python even though at first
I hated this indentation thing and I (still) don't like the "for in
range()" way to build loops.

First question: shouldn't it be split(line)?

No. The split method is a method of strings. Since line is a string
you call here a method of a string object.

Ok, but then why if I write this

for line in fin.readlines():
line = split(line)

it does work anyway? Are there two ways to use the same function?

Now, my book says that lambda works by applying (in this example) the
instructions in "dat != '-9999' and dat or ''" to the dat variable. Does
this mean that (I'm more or less guessing now) this code splits the line

It gets splitted on spaces.
assign each element to a dat variable which could have three types of
value (i.e. its "normal" value or -9999 or a space)? And then you use a

That's nearly right. This trick with `and' and `or' is necessary since
you can't have statements in Python lambda. Written the above as `if'
statement would be:
if dat != '9999':
return dat
else:
return ''

So in the end here is ho this piece of code works:
takes a line, split the line on spaces and put each element into a dat
variable. if the dat variable is equal to '9999' or whatever other symbol
is used to indicate a blank substitute it with a space otherwise leave its
value. then put together the dat (into a list?) separated by a comma.
I got it! maybe...
No since join is a method. If you have an older Python book it may be
written differently.
well I'm using "Learning to Program" by Alan Gauld which I found on the
python website. But I found the description of the join function on the
python documentation of my system. Anyway, as I said it works both ways
(at least for split, haven't tried with join).
I was thinking of buying "Learning Python" of the O'Reilly series. What do
you think?
I didn't know you were a beginner otherwise I would have written more
verbosely. The Tutor list might be the right place for you to start.

don't worry I should have said :)

Morevoer I wonder if it is possible to achieve the same thing without
using the functional programming.

Yes, but it would be less fun :-)

uhm... maybe... I'll tell you when I get there... :D:D
thanks a lot

R

Jul 18 '05 #14

Ben Finney

On Wed, 11 Feb 2004 00:44:18 +0000 (UTC), robsom wrote:

I'm beginning to appreciate Python even though at first I hated this
indentation thing
Yup. Just because something seems limiting doesn't mean it is.
and I (still) don't like the "for in range()" way to build loops.
The "for index in range( size )" is mostly recommended as a way to loop
over a list; range( size ) will elegantly generate a sequence of indices
into a list of size "size".

What is it you don't like?

Il Tue, 10 Feb 2004 18:09:33 +0100, Karl Pflästerer ha scritto:
No. The split method is a method of strings. Since line is a string
you call here a method of a string object.
Ok, but then why if I write this

for line in fin.readlines():
line = split(line)

it does work anyway?

Because somewhere earlier in that code you must have imported the name
'split' into the base namespace. It's not there to begin with (and
importing it is needlessly polluting the base namespace).

Are there two ways to use the same function?

They're two different functions. One is a method of string objects, one
is a function you've imported from somewhere. (This confusion is partly
why importing symbols into the base namespace is a bad idea.)

--
\ "I think a good gift for the President would be a chocolate |
`\ revolver. And since he's so busy, you'd probably have to run up |
_o__) to him real quick and hand it to him." -- Jack Handey |
Ben Finney <http://bignose.squidly.org/>

Jul 18 '05 #15

Skip Montanaro

robsom> ... I (still) don't like the "for in range()" way to build
robsom> loops.

That's needed only when you need a numeric index. Much of the time you can
just iterate over a container object:

for element in mylist:
print element

or

for key in mydict:
print (key, mydict[key])

Skip

Jul 18 '05 #16

robsom

On Wed, 11 Feb 2004 00:44:18 +0000 (UTC), robsom wrote:
I'm beginning to appreciate Python even though at first I hated this
indentation thing
Yup. Just because something seems limiting doesn't mean it is.

Right, I'm just saying that I was used to indent when I wanted and what I
wanted and that I missed the "closing instruction"
But, to be fair, I've adapted much faster than I thought. Just a matter of
getting used to it.
The "for index in range( size )" is mostly recommended as a way to loop
over a list; range( size ) will elegantly generate a sequence of indices
into a list of size "size".

What is it you don't like?
well, I just think that a kind of sintax like: for i=x to y
is much more intuitive and easy to use. It caused me a lot of errors,
particularly in the beginning because I wasn't getting the values I
wanted for the index. Again is probably just a matter of getting used to
it.
Because somewhere earlier in that code you must have imported the name
'split' into the base namespace. It's not there to begin with (and
importing it is needlessly polluting the base namespace).
They're two different functions. One is a method of string objects, one
is a function you've imported from somewhere. (This confusion is partly
why importing symbols into the base namespace is a bad idea.)

in fact there is an import string * instruction at the beginning.
ok, so you suggest using the method because it results in a faster code?
thanks

R

Jul 18 '05 #17

Ben Finney

On Thu, 12 Feb 2004 00:56:22 +0000 (UTC), robsom wrote:

On Wed, 11 Feb 2004 00:44:18 +0000 (UTC), robsom wrote:
[why do str.split() and split(str) both work?]

Because somewhere earlier in that code you must have imported the name
'split' into the base namespace. It's not there to begin with (and
importing it is needlessly polluting the base namespace).

in fact there is an import string * instruction at the beginning.

Yes, the "from module import *" is a deprecated usage; it pollutes the
base namespace, which leads to confusions like this and others.

The recommended way is to:

import module

module.function()

So, in your case, this would be:

import string

string.split( str )

Then it becomes clear in each instance that the "split" function comes
from the "string" module, and not some other arbitrary module.

[are they the same function?]

They're two different functions. One is a method of string objects,
one is a function you've imported from somewhere. (This confusion is
partly why importing symbols into the base namespace is a bad idea.)

ok, so you suggest using the method because it results in a faster
code?

No, I recommend using the method because it's already part of the string
objects, and results in clearer code.

--
\ "It ain't so much the things we don't know that get us in |
`\ trouble. It's the things we know that ain't so." -- Artemus |
_o__) Ward (1834-67), U.S. journalist |
Ben Finney <http://bignose.squidly.org/>

Jul 18 '05 #18

John J. Lee

Skip Montanaro <sk**@pobox.com> writes:

robsom> ... I (still) don't like the "for in range()" way to build
robsom> loops.

That's needed only when you need a numeric index. Much of the time you can
just iterate over a container object:

for element in mylist:
print element

or

for key in mydict:
print (key, mydict[key])

also (Python 2.3):

for index, element in enumerate(mylist):
# blah, blah
John

Jul 18 '05 #19

splitting tables

Similar topics