473,396 Members | 2,039 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes to post your question to a community of 473,396 software developers and data experts.

Cleaner idiom for text processing?

Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Thanks!
Mike Ellis
Jul 18 '05 #1
16 1456
Michael Ellis wrote:
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])


for line in infile:
tokens = line.split()
d = dict(zip(tokens[::2], tokens[1::2]))
do_something_with_values(...)

By the way, don't use "dict" as a variable name. It's already
a builtin factory function to create dictionaries.

-Peter
Jul 18 '05 #2
I'd move the logic that turns the file into the form you want to
process, under the assumption that you'll use this code from multiple
places.
def process_tokens(f):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens[i]] = tokens[i+1]
yield d
Then,
for d in process_tokens(infile):
do_something_with_values(d['foo'], d['bar'])

If the specific keys you want from each line are constant for the loop,
have process_tokens yield those items in sequence:
def process_tokens2(f, keys):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens[i]] = tokens[i+1]
yield [d[k] for k in keys]

for foo, bar in process_tokens(infile, "foo", "bar"):
do_something_with_values(foo, bar)

Jeff

Jul 18 '05 #3
Michael Ellis wrote:
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])


Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}


Peter

Jul 18 '05 #4
Peter Otten <__*******@web.de> wrote in news:c9*************@news.t-
online.com:
Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
You can also do that without using itertools:
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(zip(nv,nv)) {'baz': '3', 'foo': '1', 'bar': '2'}


However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?

Jul 18 '05 #5
Duncan Booth wrote:
Peter Otten <__*******@web.de> wrote in news:c9*************@news.t-
online.com:
Yet another way to create the dictionary:
> import itertools
> nv = iter("foo 1 bar 2 baz 3\n".split())
> dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
>
You can also do that without using itertools:
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(zip(nv,nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
The advantage of my solution is that it omits the intermediate list.
However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?


I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be sure
whether the previous consumer just read 10 items ahead for efficiency
reasons. Allowing such optimizations would in effect limit iterators to for
loops. Moreover, the calling function has no way of knowing whether that
would really be efficient as the first iterator might take a looong time to
yield the next value while the second could just throw a StopIteration. If
a way around this is ever found, checking izip()'s arguments for identity
is only a minor complication.

But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():
from itertools import *
nv = "foo 1 bar 2 baz 3\n".split()
dict(izip(islice(nv, 0, None, 2), islice(nv, 1, None, 2))) {'baz': '3', 'foo': '1', 'bar': '2'}


However, this is less readable (probably slower too) than the original with
normal slices and therefore not worth the effort for small lists like (I
guess) those in the OP's problem.

Peter

Jul 18 '05 #6
Peter Otten wrote:
But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():


No! Yours is much more elegant! Wonderful... zero overhead.

-Peter
Jul 18 '05 #7
Peter Otten <__*******@web.de> wrote in
news:c9*************@news.t-online.com:
However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this
way, but is the order of consuming the inputs actually guaranteed
anywhere?


I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be
sure whether the previous consumer just read 10 items ahead for
efficiency reasons. Allowing such optimizations would in effect limit
iterators to for loops. Moreover, the calling function has no way of
knowing whether that would really be efficient as the first iterator
might take a looong time to yield the next value while the second
could just throw a StopIteration. If a way around this is ever found,
checking izip()'s arguments for identity is only a minor complication.


What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.

Jul 18 '05 #8
Duncan Booth wrote:
What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.


Or as an interim measure: write lots of elegant code using that
technique, and then if anyone suggests changing the way it
works the rest of the world will shout "no, it will break code!".
;-)

-Peter
Jul 18 '05 #9
Duncan Booth wrote:
What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.
This would also affect the calling code, when the arguments are iterators
(just swapping arguments to simulate the effect of the proposed
optimization):
ia, ib = iter(range(3)), iter([])
zip(ia, ib) [] ia.next() 1 ia, ib = iter(range(3)), iter([])
zip(ib, ia) [] ia.next() 0


Optimizations that are visible from the calling code always seem a bad idea
and against Python's philosophy. I admit the above reusage pattern is not
very likely, though.
Passing the same iterator multiple times to izip is a pretty neat idea,
but I would still be happier if the documentation explicitly stated that
it consumes its arguments left to right.


From the itertools documentation:

"""
izip(*iterables)

Make an iterator that aggregates elements from each of the iterables. Like
zip() except that it returns an iterator instead of a list. Used for
lock-step iteration over several iterables at a time. Equivalent to:

def izip(*iterables):
iterables = map(iter, iterables)
while iterables:
result = [i.next() for i in iterables]
yield tuple(result)
"""

I'd say the "Equivalent to [reference implementation]" statement should meet
your request.

Peter
Jul 18 '05 #10
Peter Hansen wrote:
Peter Otten wrote:
But if that lets you sleep better at night, change Peter Hansen's
suggestion to use islice():


No! Yours is much more elegant! Wonderful... zero overhead.


I should mention I've picked up the trick on c.l.py (don't remember the
poster).

Peter
Jul 18 '05 #11
me****@frogwing.com (Michael Ellis) writes:
for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])


Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])
Jul 18 '05 #12
Paul Rubin wrote:
me****@frogwing.com (Michael Ellis) writes:
for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])


Paul, I don't understand why you say "pessimized".
The only potential flaw in the original and most
(all) of the other solutions seems to be present
in yours as well: if there are an odd number of
tokens on a line an exception will be raised.

-Peter
Jul 18 '05 #13
Peter Otten <__*******@web.de> wrote in message news:<c9*************@news.t-online.com>...
Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}


Peter


Cool! This should go in the Cookbook, in the shortcuts section.

Michele Simionato
Jul 18 '05 #14
Peter Otten <__*******@web.de> wrote in message news:<c9*************@news.t-online.com>...
Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
Peter


BTW, the name I have seem for this kind of things is chop:
import itertools
def chop(it, n): .... tup = (iter(it),)*n
.... return itertools.izip(*tup)
.... list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)] list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)] list(chop([1,2,3,4,5,6],1))

[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

Michele Simionato
Jul 18 '05 #15
has
me****@frogwing.com (Michael Ellis) wrote in message news:<f2**************************@posting.google. com>...
Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?


If it's terseness you want:

import re

for line in infile:
d = dict(re.findall('([^ ]+) ([^ ]+)', line))
do_something_with_values(d['foo'], d['bar'])

Hardly worth worrying about though...
Jul 18 '05 #16
Michele Simionato wrote:
import itertools
def chop(it, n): ... tup = (iter(it),)*n
... return itertools.izip(*tup)
... list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)] list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)] list(chop([1,2,3,4,5,6],1))

[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )


I don't think so. IMO the itertools examples page would be a better place
for the above than the cookbook. If an example had to be removed in
exchange, that would be iteritems(). Raymond, are you looking?

Peter

Jul 18 '05 #17

This thread has been closed and replies have been disabled. Please start a new discussion.

Similar topics

3
by: David Morgenthaler | last post by:
In many of my scripts I've used the following idiom for accessing data files placed nearby: BASEDIR = os.path.dirname(__file__) .. .. .. fp = file(os.path.join(BASEDIR,"somefile.txt")) .. ..
1
by: Pekka Niiranen | last post by:
Hi there, after reading TkInter/thread -recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/82965 I wondered if it was possible to avoid using threads for the following problem: ...
20
by: Al Moritz | last post by:
Hi all, I was always told that the conversion of Word files to HTML as done by Word itself sucks - you get a lot of unnecessary code that can influence the design on web browsers other than...
8
by: Imran | last post by:
hello all, I have to parse a text file and get some value in that. text file content is as follows. ####TEXT FILE CONTENT STARTS HERE ##### /start first 0x1234 AC /end
13
by: John Salerno | last post by:
The code to look at is the try statement in the NumbersValidator class, just a few lines down. Is this a clean way to write it? i.e. is it okay to have all those return statements? Is this a good...
0
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
0
by: JosAH | last post by:
Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...
1
by: Xah Lee | last post by:
Text Processing with Emacs Lisp Xah Lee, 2007-10-29 This page gives a outline of how to use emacs lisp to do text processing, using a specific real-world problem as example. If you don't know...
3
by: Rinaldo | last post by:
Hi, I have a label on my dialogbox who has to change text while running. This is what I do: lblBackup.Text = "Bezig met de backup naar " + F1.FTPserver; but the text does'nt appear, only if...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: emmanuelkatto | last post by:
Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: nemocccc | last post by:
hello, everyone, I want to develop a software for my android phone for daily needs, any suggestions?
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Overview: Windows 11 and 10 have less user interface control over operating system update behaviour than previous versions of Windows. In Windows 11 and 10, there is no way to turn off the Windows...
0
agi2029
by: agi2029 | last post by:
Let's talk about the concept of autonomous AI software engineers and no-code agents. These AIs are designed to manage the entire lifecycle of a software development project—planning, coding, testing,...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.