text file parsing (awk -> python)

Daniel Nogradi

Hi list,

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

Nov 22 '06 #1

Subscribe Post Reply

6244

Peter Otten

Daniel Nogradi wrote:

I have an awk program that parses a text file which I would like to
rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)

Nov 22 '06 #2

Daniel Nogradi

I have an awk program that parses a text file which I would like to

rewrite in python. The text file has multi-line records separated by
empty lines and each single-line field has two subfields:

node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1

and this I would like to parse into a list of dictionaries like so:

mydict[0] = { 'node':10, 'x':-1, 'y':1 }
mydict[1] = { 'node':11, 'x':-2, 'y':1 }
mydict[2] = { 'node':12, 'x':-3', 'y':1 }

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

What would be the simples way to do this?

data = """node 10
x -1
y 1

node 11
x -2
y 1

node 12
x -3
y 1
"""

def open(filename):
from cStringIO import StringIO
return StringIO(data)

converters = dict(
x=int,
y=int
)

def name_value(line):
name, value = line.split(None, 1)
return name, converters.get(name, str.rstrip)(value)

if __name__ == "__main__":
from itertools import groupby
records = []

for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
records.append(dict(name_value(line) for line in record))

import pprint
pprint.pprint(records)

Thanks very much, that's exactly what I had in mind.

Thanks again,
Daniel

Nov 22 '06 #3

bearophileHUGS

Peter Otten, your solution is very nice, it uses groupby splitting on
empty lines, so it doesn't need to read the whole files into memory.

But Daniel Nogradi says:

But the names of the fields (node, x, y) keeps changing from file to
file, even their number is not fixed, sometimes it is (node, x, y, z).

Your version with the converters dict fails to convert the number of
node, z fields, etc. (generally using such converters dict is an
elegant solution, it allows to define string, float, etc fields):

converters = dict(
x=int,
y=int
)

I have created a version with a RE, but it's probably too much rigid,
it doesn't handle files with the z field, etc:

data = """node 10
y 1
x -1

node 11
x -2
y 1
z 5

node 12
x -3
y 1
z 6"""

import re
unpack = re.compile(r"(\D+) \s+ ([-+]? \d+) \s+" * 3, re.VERBOSE)

result = []
for obj in unpack.finditer(data):
block = obj.groups()
d = dict((block[i], int(block[i+1])) for i in xrange(0, 6, 2))
result.append(d)

print result
So I have just modified and simplified your quite nice solution (I have
removed the pprint, but it's the same):

def open(filename):
from cStringIO import StringIO
return StringIO(data)

from itertools import groupby

records = []
for empty, record in groupby(open("records.txt"), key=str.isspace):
if not empty:
pairs = ([k, int(v)] for k,v in map(str.split, record))
records.append(dict(pairs))

print records

Bye,
bearophile

Nov 22 '06 #4

Similar topics

Text file parsing

by: Ratnakar Pedagani | last post by:

Hi, I'm trying to parse the text file, which is of size more than 2mb. I'm using the following sample code Open "c:\sim1.txt" For Input As #1 Do While Not EOF(1) Input #1, Data If...

Visual Basic 4 / 5 / 6

where is the awk to python translator program

by: Dan Jacobson | last post by:

An old dog can't learn new tricks, so where's the a2py awk to python translator? Perl has a2p. E.g. today I wonder how to do '{print $1}', well with a2p I know how to do it in perl, but with...

Python

plain-text file parsing

by: Russell Klopfer | last post by:

Hello. I would like to know how I can parse a plain-text file. All I want to do is be able to sequentially extract each word from a document. Similar to the StringTokenizer in Java. Is there a...

Perl

Text File parsing

by: Imran | last post by:

hello all, I have to parse a text file and get some value in that. text file content is as follows. ####TEXT FILE CONTENT STARTS HERE ##### /start first 0x1234 AC /end

C / C++

< Text - file to array >

by: Carsten Kraft | last post by:

Hello Newsgroup, I think this is easy for you: I want to save the data line by line into an string array. eg. Text file: Array Line 1 Line1

C# / C Sharp

text file parsing

by: ghazanfar | last post by:

hi, i have text file of the form atom_trace('emotion_response_level(a1, 1.56072)', emotion_response_level(a1, 1.56072), ). and atom_trace('goto(a1, a3)', goto(a1, a3), ).

Python

Text File Parsing - List Unique Column Values

by: vinodmalraj | last post by:

Am new to perl language , would really help if some of you assist me how to use a regex say for example this is my log file 000046571|1000025|CUSTOMER|27-JUN-2007...

Perl

Parsing text file with #include and #define directives

by: python | last post by:

I'm parsing a text file for a proprietary product that has the following 2 directives: #include <somefile> #define <name<value> Defined constants are referenced via <#name#syntax. I'm...

Python

Text file parsing in batch files

by: watertraveller | last post by:

Hi. I'm new to batch files, and relatively new to the Windows command line in general. I'm making a batch file for the Windows XP command line. I want to examine, for each line of a text file,...

Microsoft Windows

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math

Is that possible of reading the .csv file in column wise and the column have different lengths ?

by: Sonnysonu | last post by:

This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...

C / C++

How to build RAID in BIOS?

by: Hystou | last post by:

There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...

Computer Hardware

What is ONU?

by: marktang | last post by:

ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...

General

Problem With Comparison Operator <=> in G++

by: Oralloy | last post by:

Hello folks, I am unable to find appropriate documentation on the type promotion of bit-fields when using the generalised comparison operator "<=>". The problem is that using the GNU compilers,...

C / C++

The easy way to turn off automatic updates for Windows 10/11

by: Hystou | last post by:

Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...

Windows Server

AI Job Threat for Devs

by: agi2029 | last post by:

Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

Career Advice

Access Europe - Using VBA to create a class based on a table - Wed 1 May

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 1 May 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome a new...

Microsoft Access / VBA