473,411 Members | 2,289 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,411 software developers and data experts.

text file parsing (awk -> python)

Hi list,

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?
Nov 22 '06 #1
3 6244
Daniel Nogradi wrote:
I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?
data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)
Nov 22 '06 #2
I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)

Thanks very much, that's exactly what I had in mind.

Thanks again,
Daniel
Nov 22 '06 #3
Peter Otten, your solution is very nice, it uses groupby splitting on
empty lines, so it doesn't need to read the whole files into memory.

But Daniel Nogradi says:
But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).
Your version with the converters dict fails to convert the number of
node, z fields, etc. (generally using such converters dict is an
elegant solution, it allows to define string, float, etc fields):
converters = dict(
x=int,
y=int
)

I have created a version with a RE, but it's probably too much rigid,
it doesn't handle files with the z field, etc:

data = """node 10
y 1
x -1

node 11
x -2
y 1
z 5

node 12
x -3
y 1
z 6"""

import re
unpack = re.compile(r"(\D+) \s+ ([-+]? \d+) \s+" * 3, re.VERBOSE)

result = []
for obj in unpack.finditer(data):
block = obj.groups()
d = dict((block[i], int(block[i+1])) for i in xrange(0, 6, 2))
result.append(d)

print result
So I have just modified and simplified your quite nice solution (I have
removed the pprint, but it's the same):

def open(filename):
from cStringIO import StringIO
return StringIO(data)

from itertools import groupby

records = []
for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
pairs = ([k, int(v)] for k,v in map(str.split, record))
records.append(dict(pairs))

print records

Bye,
bearophile

Nov 22 '06 #4

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

6
by: Ratnakar Pedagani | last post by:
Hi, I'm trying to parse the text file, which is of size more than 2mb. I'm using the following sample code Open "c:\sim1.txt" For Input As #1 Do While Not EOF(1) Input #1, Data If...
2
by: Dan Jacobson | last post by:
An old dog can't learn new tricks, so where's the a2py awk to python translator? Perl has a2p. E.g. today I wonder how to do '{print $1}', well with a2p I know how to do it in perl, but with...
2
by: Russell Klopfer | last post by:
Hello. I would like to know how I can parse a plain-text file. All I want to do is be able to sequentially extract each word from a document. Similar to the StringTokenizer in Java. Is there a...
8
by: Imran | last post by:
hello all, I have to parse a text file and get some value in that. text file content is as follows. ####TEXT FILE CONTENT STARTS HERE ##### /start first 0x1234 AC /end
4
by: Carsten Kraft | last post by:
Hello Newsgroup, I think this is easy for you: I want to save the data line by line into an string array. eg. Text file: Array Line 1 Line1
10
by: ghazanfar | last post by:
hi, i have text file of the form atom_trace('emotion_response_level(a1, 1.56072)', emotion_response_level(a1, 1.56072), ). and atom_trace('goto(a1, a3)', goto(a1, a3), ).
3
by: vinodmalraj | last post by:
Am new to perl language , would really help if some of you assist me how to use a regex say for example this is my log file 000046571|1000025|CUSTOMER|27-JUN-2007...
2
by: python | last post by:
I'm parsing a text file for a proprietary product that has the following 2 directives: #include <somefile> #define <name<value> Defined constants are referenced via <#name#syntax. I'm...
4
watertraveller
by: watertraveller | last post by:
Hi. I'm new to batch files, and relatively new to the Windows command line in general. I'm making a batch file for the Windows XP command line. I want to examine, for each line of a text file,...
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
Oralloy
by: Oralloy | last post by:
Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...
0
isladogs
by: isladogs | last post by:
The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.