<a href="https://bytes.com/topic/python/answers/31667-cleaner-idiom-text-processing">Cleaner idiom for text processing?

Michael Ellis wrote:

I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

for line in infile:
tokens = line.split()
d = dict(zip(tokens[::2], tokens[1::2]))
do_something_with_values(...)

By the way, don't use "dict" as a variable name. It's already
a builtin factory function to create dictionaries.

-Peter

Jul 18 '05 #2

Jeff Epler

I'd move the logic that turns the file into the form you want to
process, under the assumption that you'll use this code from multiple
places.
def process_tokens(f):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens[i]] = tokens[i+1]
yield d
Then,
for d in process_tokens(infile):
do_something_with_values(d['foo'], d['bar'])

If the specific keys you want from each line are constant for the loop,
have process_tokens yield those items in sequence:
def process_tokens2(f, keys):
for line in infile:
tokens = line.split()
d = {}
for i in range(0, len(tokens), 2): d[tokens[i]] = tokens[i+1]
yield [d[k] for k in keys]

for foo, bar in process_tokens(infile, "foo", "bar"):
do_something_with_values(foo, bar)

Jeff

Jul 18 '05 #3

Michael Ellis wrote:

I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Yet another way to create the dictionary:

import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}

Peter

Jul 18 '05 #4

Duncan Booth

Peter Otten <__*******@web.de> wrote in news:c9*************@news.t-
online.com:

Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
You can also do that without using itertools:
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(zip(nv,nv)) {'baz': '3', 'foo': '1', 'bar': '2'}

However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?

Jul 18 '05 #5

Duncan Booth wrote:

Peter Otten <__*******@web.de> wrote in news:c9*************@news.t-
online.com:
Yet another way to create the dictionary:
> import itertools
> nv = iter("foo 1 bar 2 baz 3\n".split())
> dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
>
You can also do that without using itertools:
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(zip(nv,nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
The advantage of my solution is that it omits the intermediate list.
However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this way,
but is the order of consuming the inputs actually guaranteed anywhere?

I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be sure
whether the previous consumer just read 10 items ahead for efficiency
reasons. Allowing such optimizations would in effect limit iterators to for
loops. Moreover, the calling function has no way of knowing whether that
would really be efficient as the first iterator might take a looong time to
yield the next value while the second could just throw a StopIteration. If
a way around this is ever found, checking izip()'s arguments for identity
is only a minor complication.

But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():
from itertools import *
nv = "foo 1 bar 2 baz 3\n".split()
dict(izip(islice(nv, 0, None, 2), islice(nv, 1, None, 2))) {'baz': '3', 'foo': '1', 'bar': '2'}

However, this is less readable (probably slower too) than the original with
normal slices and therefore not worth the effort for small lists like (I
guess) those in the OP's problem.

Peter

Jul 18 '05 #6

Peter Otten wrote:

But if that lets you sleep better at night, change Peter Hansen's suggestion
to use islice():

No! Yours is much more elegant! Wonderful... zero overhead.

-Peter

Jul 18 '05 #7

Duncan Booth

Peter Otten <__*******@web.de> wrote in
news:c9*************@news.t-online.com:

However, I'm not sure I trust either of these solutions. I know that
intuitively it would seem that both zip and izip should act in this
way, but is the order of consuming the inputs actually guaranteed
anywhere?

I think an optimization that changes the order assumed above would be
*really* weird. When passing around an iterator, you could never be
sure whether the previous consumer just read 10 items ahead for
efficiency reasons. Allowing such optimizations would in effect limit
iterators to for loops. Moreover, the calling function has no way of
knowing whether that would really be efficient as the first iterator
might take a looong time to yield the next value while the second
could just throw a StopIteration. If a way around this is ever found,
checking izip()'s arguments for identity is only a minor complication.

What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.

Jul 18 '05 #8

Duncan Booth wrote:

What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.

Passing the same iterator multiple times to izip is a pretty neat idea, but
I would still be happier if the documentation explicitly stated that it
consumes its arguments left to right.

Or as an interim measure: write lots of elegant code using that
technique, and then if anyone suggests changing the way it
works the rest of the world will shout "no, it will break code!".
;-)

-Peter

Jul 18 '05 #9

Duncan Booth wrote:

What happens if someone works out that izip can be made much faster by
consuming its iterators from right to left instead of left to right? That
isn't nearly as far fetched as reading ahead.
This would also affect the calling code, when the arguments are iterators
(just swapping arguments to simulate the effect of the proposed
optimization):

ia, ib = iter(range(3)), iter([])
zip(ia, ib) [] ia.next() 1 ia, ib = iter(range(3)), iter([])
zip(ib, ia) [] ia.next() 0

Optimizations that are visible from the calling code always seem a bad idea
and against Python's philosophy. I admit the above reusage pattern is not
very likely, though.
Passing the same iterator multiple times to izip is a pretty neat idea,
but I would still be happier if the documentation explicitly stated that
it consumes its arguments left to right.

From the itertools documentation:

"""
izip(*iterables)

Make an iterator that aggregates elements from each of the iterables. Like
zip() except that it returns an iterator instead of a list. Used for
lock-step iteration over several iterables at a time. Equivalent to:

def izip(*iterables):
iterables = map(iter, iterables)
while iterables:
result = [i.next() for i in iterables]
yield tuple(result)
"""

I'd say the "Equivalent to [reference implementation]" statement should meet
your request.

Peter

Jul 18 '05 #10

Peter Hansen wrote:

Peter Otten wrote:
But if that lets you sleep better at night, change Peter Hansen's
suggestion to use islice():

No! Yours is much more elegant! Wonderful... zero overhead.

I should mention I've picked up the trick on c.l.py (don't remember the
poster).

Peter

Jul 18 '05 #11

Paul Rubin

me****@frogwing.com (Michael Ellis) writes:

for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])

Jul 18 '05 #12

Paul Rubin wrote:

me****@frogwing.com (Michael Ellis) writes:
for line in infile:
tokens = line.split()
dict = {}
for i in range(0, len(tokens),2) dict[tokens[i]] = tokens[i+1]
do_something_with_values(dict['foo'],dict['bar'])

Here's a pessimized version:

for line in infile:
tokens = line.split()
d = {}
while tokens:
name = tokens.pop(0)
value = tokens.pop(0)
d[name] = value
do_something_with_values(dict['foo'],dict['bar'])

Paul, I don't understand why you say "pessimized".
The only potential flaw in the original and most
(all) of the other solutions seems to be present
in yours as well: if there are an odd number of
tokens on a line an exception will be raised.

-Peter

Jul 18 '05 #13

Michele Simionato

Peter Otten <__*******@web.de> wrote in message news:<c9*************@news.t-online.com>...

Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}

Peter

Cool! This should go in the Cookbook, in the shortcuts section.

Michele Simionato

Jul 18 '05 #14

Michele Simionato

Peter Otten <__*******@web.de> wrote in message news:<c9*************@news.t-online.com>...

Yet another way to create the dictionary:
import itertools
nv = iter("foo 1 bar 2 baz 3\n".split())
dict(itertools.izip(nv, nv)) {'baz': '3', 'foo': '1', 'bar': '2'}
Peter

BTW, the name I have seem for this kind of things is chop:

import itertools
def chop(it, n): .... tup = (iter(it),)*n
.... return itertools.izip(*tup)
.... list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)] list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)] list(chop([1,2,3,4,5,6],1))

[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

Michele Simionato

Jul 18 '05 #15

has

me****@frogwing.com (Michael Ellis) wrote in message news:<f2**************************@posting.google. com>...

Hi,
I have some data files with lines in space-delimited <name> <value>
format. There are multiple name-value pairs per line.

Is there a cleaner idiom than the following for reading each line into
an associative array for the purpose of accessing values by name?

If it's terseness you want:

import re

for line in infile:
d = dict(re.findall('([^ ]+) ([^ ]+)', line))
do_something_with_values(d['foo'], d['bar'])

Hardly worth worrying about though...

Jul 18 '05 #16

In search of idiom in py2exe

Michele Simionato wrote:

import itertools
def chop(it, n): ... tup = (iter(it),)*n
... return itertools.izip(*tup)
... list(chop([1,2,3,4,5,6],3)) [(1, 2, 3), (4, 5, 6)] list(chop([1,2,3,4,5,6],2)) [(1, 2), (3, 4), (5, 6)] list(chop([1,2,3,4,5,6],1))

[(1,), (2,), (3,), (4,), (5,), (6,)]

(I don't remember if this is already in itertools ... )

I don't think so. IMO the itertools examples page would be a better place
for the above than the cookbook. If an example had to be removed in
exchange, that would be iteritems(). Raymond, are you looking?

Peter

Jul 18 '05 #17

Similar topics

by: David Morgenthaler | last post by:

In many of my scripts I've used the following idiom for accessing data files placed nearby: BASEDIR = os.path.dirname(__file__) .. .. .. fp = file(os.path.join(BASEDIR,"somefile.txt")) .. ..

HELP: Tkinter idiom needed

by: Pekka Niiranen | last post by:

Hi there, after reading TkInter/thread -recipe: http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/82965 I wondered if it was possible to avoid using threads for the following problem: ...

Converting Word files to HTML in Word Cleaner

by: Al Moritz | last post by:

Hi all, I was always told that the conversion of Word files to HTML as done by Word itself sucks - you get a lot of unnecessary code that can influence the design on web browsers other than...

HTML / CSS

Text File parsing

by: Imran | last post by:

hello all, I have to parse a text file and get some value in that. text file content is as follows. ####TEXT FILE CONTENT STARTS HERE ##### /start first 0x1234 AC /end

C / C++

cleaner way to write this try/except statement?

by: John Salerno | last post by:

The code to look at is the try statement in the NumbersValidator class, just a few lines down. Is this a clean way to write it? i.e. is it okay to have all those return statements? Is this a good...

Text retrieval systems - 1: Introduction

by: JosAH | last post by:

Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...

Java

Text retrieval systems - 7: the Software and Data

by: JosAH | last post by:

Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...

Java

emacs lisp as text processing language...

by: Xah Lee | last post by:

Text Processing with Emacs Lisp Xah Lee, 2007-10-29 This page gives a outline of how to use emacs lisp to do text processing, using a specific real-world problem as example. If you don't know...