By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,908 Members | 1,892 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,908 IT Pros & Developers. It's quick & easy.

text file parsing (awk -> python)

P: n/a
Hi list,

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?
Nov 22 '06 #1
Share this Question
Share on Google+
3 Replies


P: n/a
Daniel Nogradi wrote:
I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?
data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)
Nov 22 '06 #2

P: n/a
I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)

Thanks very much, that's exactly what I had in mind.

Thanks again,
Daniel
Nov 22 '06 #3

P: n/a
Peter Otten, your solution is very nice, it uses groupby splitting on
empty lines, so it doesn't need to read the whole files into memory.

But Daniel Nogradi says:
But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).
Your version with the converters dict fails to convert the number of
node, z fields, etc. (generally using such converters dict is an
elegant solution, it allows to define string, float, etc fields):
converters = dict(
x=int,
y=int
)

I have created a version with a RE, but it's probably too much rigid,
it doesn't handle files with the z field, etc:

data = """node 10
y 1
x -1

node 11
x -2
y 1
z 5

node 12
x -3
y 1
z 6"""

import re
unpack = re.compile(r"(\D+) \s+ ([-+]? \d+) \s+" * 3, re.VERBOSE)

result = []
for obj in unpack.finditer(data):
block = obj.groups()
d = dict((block[i], int(block[i+1])) for i in xrange(0, 6, 2))
result.append(d)

print result
So I have just modified and simplified your quite nice solution (I have
removed the pprint, but it's the same):

def open(filename):
from cStringIO import StringIO
return StringIO(data)

from itertools import groupby

records = []
for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
pairs = ([k, int(v)] for k,v in map(str.split, record))
records.append(dict(pairs))

print records

Bye,
bearophile

Nov 22 '06 #4

This discussion thread is closed

Replies have been disabled for this discussion.