By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
440,096 Members | 1,568 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 440,096 IT Pros & Developers. It's quick & easy.

Controlling a generator the pythonic way

P: n/a
Hi,

I'm trying to figure out what is the most pythonic way to interact with
a generator.

The task I'm trying to accomplish is writing a PDF tokenizer, and I want
to implement it as a Python generator. Suppose all the ugly details of
toknizing PDF can be handled (such as embedded streams of arbitrary
binary content). There remains one problem, though: In order to get
random file access, the tokenizer should not simply spit out a series of
tokens read from the file sequentially; it should rather be possible to
point it at places in the file at random.

I can see two possibilities to do this: either the current file position
has to be read from somewhere (say, a mutable object passed to the
generator) after each yield, or a new generator needs to be instantiated
every time the tokenizer is pointed to a new file position.

The first approach has both the disadvantage that the pointer value is
exposed and that due to the complex rules for hacking a PDF to tokens,
there will be a lot of yield statements in the generator code, which
would make for a lot of pointer assignments. This seems ugly to me.

The second approach is cleaner in that respect, but pointing the
tokenizer to some place has now the added semantics of creating a whole
new generator instance. The programmer using the tokenizer now needs to
remember to throw away any references to the generator each time the
pointer is reset, which is also ugly.

Does anybody here have a third way of dealing with this? Otherwise,
which ugliness is the more pythonic one?

Thanks a lot for any ideas.

--
Thomas
Jul 19 '05 #1
Share this Question
Share on Google+
12 Replies


P: n/a
Thomas Lotze wrote:
I can see two possibilities to do this: either the current file position
has to be read from somewhere (say, a mutable object passed to the
generator) after each yield, or a new generator needs to be instantiated
every time the tokenizer is pointed to a new file position.
...
Does anybody here have a third way of dealing with this? Otherwise,
which ugliness is the more pythonic one?


The third approach, which is certain to be cleanest for this situation,
is to have a custom class which stores the state information you need,
and have the generator simply be a method in that class. There's no
reason that a generator has to be a standalone function.

class PdfTokenizer:
def __init__(self, ...):
# set up initial state

def getTokens(self):
while whatever:
yield token

def seek(self, newPosition):
# change state here

# usage:
pdf = PdfTokenizer('myfile.pdf', ...)
for token in pdf.getTokens():
# do stuff...

if I need to change position:
pdf.seek(...)

Easy as pie! :-)

-Peter
Jul 19 '05 #2

P: n/a
Peter Hansen wrote:
Thomas Lotze wrote:
I can see two possibilities to do this: either the current file position
has to be read from somewhere (say, a mutable object passed to the
generator) after each yield, [...]


The third approach, which is certain to be cleanest for this situation, is
to have a custom class which stores the state information you need, and
have the generator simply be a method in that class.


Which is, as far as the generator code is concerned, basically the same as
passing a mutable object to a (possibly standalone) generator. The object
will likely be called self, and the value is stored in an attribute of it.

Probably this is indeed the best way as it doesn't require the programmer
to remember any side-effects.

It does, however, require a lot of attribute access, which does cost some
cycles.

A related problem is skipping whitespace. Sometimes you don't care about
whitespace tokens, sometimes you do. Using generators, you can either set
a state variable, say on the object the generator is an attribute of,
before each call that requires a deviation from the default, or you can
have a second generator for filtering the output of the first. Again, both
solutions are ugly (the second more so than the first). One uses
side-effects instead of passing parameters, which is what one really
wants, while the other is dumb and slow (filtering can be done without
taking a second look at things).

All of this makes me wonder whether more elaborate generator semantics
(maybe even allowing for passing arguments in the next() call) would not
be useful. And yes, I have read the recent postings on PEP 343 - sigh.

--
Thomas
Jul 19 '05 #3

P: n/a
Thomas Lotze wrote:
Which is, as far as the generator code is concerned, basically the same as
passing a mutable object to a (possibly standalone) generator. The object
will likely be called self, and the value is stored in an attribute of it.
Fair enough, but who cares what the generator code thinks? It's what
the programmer has to deal with that matters, and an object is going to
have a cleaner interface than a generator-plus-mutable-object.
Probably this is indeed the best way as it doesn't require the programmer
to remember any side-effects.

It does, however, require a lot of attribute access, which does cost some
cycles.


Hmm... "premature optimization" is all I have to say about that.

-Peter
Jul 19 '05 #4

P: n/a
Thomas Lotze <th****@thomas-lotze.de> writes:
A related problem is skipping whitespace. Sometimes you don't care about
whitespace tokens, sometimes you do. Using generators, you can either set
a state variable, say on the object the generator is an attribute of,
before each call that requires a deviation from the default, or you can
have a second generator for filtering the output of the first. Again, both
solutions are ugly (the second more so than the first). One uses
side-effects instead of passing parameters, which is what one really
wants, while the other is dumb and slow (filtering can be done without
taking a second look at things).


I wouldn't call the first method ugly; I'd say it's *very* OO.

Think of an object instance as a machine. It has various knobs,
switches and dials you can use to control it's behavior, and displays
you can use to read data from it, or parts of its state . A switch
labelled "ignore whitespace" is a perfectly reasonable thing for a
tokenizing machine to have.

Yes, such a switch gets the desired behavior as a side effect. Then
again, a generator that returns tokens has a desired behavior
(advancing to the next token) as a side effect(*). If you think about
these things as the state of the object, rather than "side effects",
it won't seem nearly as ugly. In fact, part of the point of using a
class is to encapsulate the state required for some activity in one
place.

Wanting to do everything via parameters to methods is a very top-down
way of looking at the problem. It's not necessarily correct in an OO
environment.

<mike

*) It's noticable that some OO languages/libraries avoid this side
effect: the read method updates an attribute, so you do the read then
get the object read from the attribute. That's very OO, but not very
pythonic.
--
Mike Meyer <mw*@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
Jul 19 '05 #5

P: n/a
Peter Hansen wrote:
Fair enough, but who cares what the generator code thinks? It's what the
programmer has to deal with that matters, and an object is going to have a
cleaner interface than a generator-plus-mutable-object.


That's right, and among the choices discussed, the object is the one I do
prefer. I just don't feel really satisfied...
It does, however, require a lot of attribute access, which does cost
some cycles.


Hmm... "premature optimization" is all I have to say about that.


But when is the right time to optimize? There's a point when the thing
runs, does the right thing and - by the token of "make it run, make it
right, make it fast" - might get optimized. And if there are places in a
PDF library that might justly be optimized, the tokenizer is certainly one
of them as it gets called really often.

Still, I'm going to focus on cleaner code and, first and foremost, a clean
API if it comes to a decision between these goals and optimization - at
least as long as I'm talking about pure Python code.

--
Thomas
Jul 19 '05 #6

P: n/a
Mike Meyer wrote:
Yes, such a switch gets the desired behavior as a side effect. Then again,
a generator that returns tokens has a desired behavior (advancing to the
next token) as a side effect(*).
That's certainly true.
If you think about these things as the
state of the object, rather than "side effects", it won't seem nearly as
ugly. In fact, part of the point of using a class is to encapsulate the
state required for some activity in one place.

Wanting to do everything via parameters to methods is a very top-down way
of looking at the problem. It's not necessarily correct in an OO
environment.
What worries me about the approach of changing state before making a
next() call instead of doing it at the same time by passing a parameter is
that the state change is meant to affect only a single call. The picture
might fit better (IMO) if it didn't look so much like working around the
fact that the next() call can't take parameters for some technical reason.

I agree that decoupling state changes and next() calls would be perfectly
beautiful if they were decoupled in the problem one wants to model. They
aren't.
*) It's noticable that some OO languages/libraries avoid this side
effect: the read method updates an attribute, so you do the read then
get the object read from the attribute. That's very OO, but not very
pythonic.


Just out of curiosity: What makes you state that that behaviour isn't
pythonic? Is it because Python happens to do it differently, because of a
gut feeling, or because of some design principle behind Python I fail to
see right now?

--
Thomas
Jul 19 '05 #7

P: n/a
Thomas Lotze wrote:
Does anybody here have a third way of dealing with this?


Sleeping a night sometimes is an insightful exercise *g*

I realized that there is a reason why fiddling with the pointer from
outside the generator defeats much of the purpose of using one. The
implementation using a simple method call instead of a generator needs
to store some internal state variables on an object to save them for the
next call, among them the pointer and a tokenization mode.

I could make the thing a generator by turning the single return
statement into a yield statement and adding a loop, leaving all the
importing and exporting of the pointer intact - after all, someone might
reset the pointer between next() calls.

This is, however, hardly using all the possibilities a generator allows.
I'd rather like to get rid of the mode switches by doing special things
where I detect the need for them, yielding the result, and proceeding as
before. But as soon as I move information from explicit (state variables
that can be reset along with the pointer) to implicit (the point where
the generator is suspended after yielding a token), resetting the
pointer will lead to inconsistencies.

So, it seems to me that if I do want to use generators for any practical
reason instead of just because generators are way cool, they need to be
instantiated anew each time the pointer is reset, for simple consistency
reasons.

Now a very simple idea struck me: If one is worried about throwing away
a generator as a side-effect of resetting the tokenization pointer, why
not define the whole tokenizer as not being resettable? Then the thing
needs to be re-instantiated very explicitly every time it is pointed
somewhere. While still feeling slightly awkward, it has lost the threat
of doing unexpected things.

Does this sound reasonable?

--
Thomas
Jul 19 '05 #8

P: n/a
Thomas Lotze wrote:
A related problem is skipping whitespace. Sometimes you don't care about
whitespace tokens, sometimes you do. Using generators, you can either set
a state variable, say on the object the generator is an attribute of,
before each call that requires a deviation from the default, or you can
have a second generator for filtering the output of the first.


Last night's sleep was really productive - I've also found another way
to tackle this problem, and it's really simple IMO. One could pass the
parameter at generator instantiation time and simply create two
generators behaving differently. They work on the same data and use the
same source code, only with a different parametrization.

All one has to care about is that they never get out of sync. If the
data pointer is an object attribute, it's clear how to do it. Otherwise,
both could acquire their data from a common generator that yields the
PDF content (or a buffer representing part of it) character by
character. This is even faster than keeping a pointer and using it as an
index on the data.

--
Thomas
Jul 19 '05 #9

P: n/a
Thomas Lotze wrote:
Mike Meyer wrote:
What worries me about the approach of changing state before making a
next() call instead of doing it at the same time by passing a parameter is
that the state change is meant to affect only a single call. The picture
might fit better (IMO) if it didn't look so much like working around the
fact that the next() call can't take parameters for some technical reason.


I suggest you make the tokenizer class itself into an iterator. Then you can define additional next() methods with additional parameters. You could wrap an actual generator for the convenience of having multiple yield statements. For example (borrowing Peter's PdfTokenizer):

class PdfTokenizer:
def __init__(self, ...):
# set up initial state
self._tokenizer = _getTokens()

def __iter__(self):
return self

def next(self, options=None):
# set self state according to options, if any
n = self._tokenizer.next()
# restore default state
return n

def nextIgnoringSpace(self):
# alterate way of specifying variations
# ...

def _getTokens(self):
while whatever:
yield token

def seek(self, newPosition):
# change state here

Kent
Jul 19 '05 #10

P: n/a

"news:86************@guru.mired.org...
Thomas Lotze <th****@thomas-lotze.de> writes:
A related problem is skipping whitespace. Sometimes you don't care about
whitespace tokens, sometimes you do. Using generators, you can either
set
a state variable, say on the object the generator is an attribute of,
before each call that requires a deviation from the default, or you can
have a second generator for filtering the output of the first. Again,
both
solutions are ugly (the second more so than the first).


Given an application that *only* wanted non-white tokens, or tokens meeting
any other condition, filtering is, to me, exactly the right thing to do and
not ugly at all. See itertools or roll your own.

Given an application that intermittently wanted to skip over non-white
tokens, I would use a *function*, not a second generator, that filtered the
first when, and only when, that was wanted. Given next_tok, the next
method of a token generator, this is simply

def next_nonwhite():
ret = next_tok()
while not iswhte(ret):
ret = next_tok()
return ret

A generic method of sending data to a generator on the fly, without making
it an attribute of a class, is to give the generator function a mutable
parameter, a list, dict, or instance, which you mutate from outside as
desired to change the operation of the generator.

The pair of statements
<mutate generator mutable>
val = gen.next()
can, of course, be wrapped in various possible gennext(args) functions at
the cost of an additional function call.

Terry J. Reedy



Jul 19 '05 #11

P: n/a
Thomas Lotze wrote:
Peter Hansen wrote:

Thomas Lotze wrote:
I can see two possibilities to do this: either the current file position
has to be read from somewhere (say, a mutable object passed to the
generator) after each yield, [...]
The third approach, which is certain to be cleanest for this situation, is
to have a custom class which stores the state information you need, and
have the generator simply be a method in that class.

Which is, as far as the generator code is concerned, basically the same as
passing a mutable object to a (possibly standalone) generator. The object
will likely be called self, and the value is stored in an attribute of it.

Probably this is indeed the best way as it doesn't require the programmer
to remember any side-effects.

It does, however, require a lot of attribute access, which does cost some
cycles.

Hmm, you could probably make your program run even quicker if you took
out all the code :-)

Don't assume that there will be a perceptible impact on performance
until you have written it they easy way. I'll leave you to Google for
quotes from Donald Knuth about premature optimization.
A related problem is skipping whitespace. Sometimes you don't care about
whitespace tokens, sometimes you do. Using generators, you can either set
a state variable, say on the object the generator is an attribute of,
before each call that requires a deviation from the default, or you can
have a second generator for filtering the output of the first. Again, both
solutions are ugly (the second more so than the first). One uses
side-effects instead of passing parameters, which is what one really
wants, while the other is dumb and slow (filtering can be done without
taking a second look at things).
And, again, your obsession with performance obscure the far more
important issue: which solution is easiest to write and maintain. If the
user then turns up short of cycles they can always elect to migrate to a
faster computer: this will almost inevitably be cheaper than paying you
to speed the program up.
All of this makes me wonder whether more elaborate generator semantics
(maybe even allowing for passing arguments in the next() call) would not
be useful. And yes, I have read the recent postings on PEP 343 - sigh.

Sigh indeed. But if you allow next() calls to take arguments you are
effectively arguing for the introduction of full coroutines into the
language, and I suspect there would be pretty limited support for that.

regards
Steve
--
Steve Holden +1 703 861 4237 +1 800 494 3119
Holden Web LLC http://www.holdenweb.com/
Python Web Programming http://pydish.holdenweb.com/

Jul 19 '05 #12

P: n/a
Thomas Lotze wrote:
I'm trying to figure out what is the most pythonic way to interact with a
generator.


JFTR, so you don't think I'd suddenly lost interest: I won't be able to
respond for a couple of days because I've just incurred a nice little
hospital session... will be back next week.

--
Thomas
Jul 19 '05 #13

This discussion thread is closed

Replies have been disabled for this discussion.