By using this site, you agree to our updated Privacy Policy and our Terms of Use. Manage your Cookies Settings.
443,898 Members | 1,183 Online
Bytes IT Community
+ Ask a Question
Need help? Post your question and get tips & solutions from a community of 443,898 IT Pros & Developers. It's quick & easy.

aligning text with space-normalized text

P: n/a
I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some
example data to clarify my problem:

py> text = """\
.... aaa bb ccc
.... dd eee. fff gggg
.... hh i.
.... jjj kk.
.... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

Note that the original "text" has a variety of whitespace between words,
but the corresponding "chunks" have only single space characters between
"words". I'm looking for the indices of each chunk, so for this
example, I'd like:

py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

Note that the indices correspond to the *original* text so that the
substrings in the given spans include the irregular whitespace:

py> for s, e in result:
.... print repr(text[s:e])
....
'aaa bb'
'ccc\ndd eee.'
'fff gggg\nhh i.'
'jjj'
'kk.'

I'm trying to write code to produce the indices. Here's what I have:

py> def get_indices(text, chunks):
.... chunks = iter(chunks)
.... chunk = None
.... for text_index, c in enumerate(text):
.... if c.isspace():
.... continue
.... if chunk is None:
.... chunk = chunks.next().replace(' ', '')
.... chunk_start = text_index
.... chunk_index = 0
.... if c != chunk[chunk_index]:
.... raise Exception('unmatched: %r %r' %
.... (c, chunk[chunk_index]))
.... else:
.... chunk_index += 1
.... if chunk_index == len(chunk):
.... yield chunk_start, text_index + 1
.... chunk = None
....

And it appears to work:

py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True

But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)
Jul 19 '05 #1
Share this Question
Share on Google+
6 Replies


P: n/a
Steven Bethard wrote:
[snip]
And it appears to work: [snip] But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)


Perhaps you should define "work" before you worry about """subjectively
"better" solutions""".

If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']
Jul 19 '05 #2

P: n/a
John Machin wrote:
If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']


This doesn't match the (admittedly vague) spec which said that chunks
are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.

STeVe
Jul 19 '05 #3

P: n/a
Steven Bethard wrote:
I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find.**However,*the*chunks*have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()).**Some


If you are willing to get your hands dirty with regexps:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()
def main():
text = """\
aaa bb ccc
dd eee. fff gggg
hh i.
jjj kk.
"""
chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]

if __name__ == "__main__":
main()

Not tested beyond what you see.

Peter

Jul 19 '05 #4

P: n/a
Steven Bethard wrote:
John Machin wrote:
If "work" is meant to detect *all* possibilities of 'chunks' not
having been derived from 'text' in the described manner, then it
doesn't work -- all information about the positions of the whitespace
is thrown away by your code.

For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec


That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.
which said that chunks are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.
If it helps, you can re-read my message.

STeVe

Jul 19 '05 #5

P: n/a
John Machin wrote:
Steven Bethard wrote:
John Machin wrote:
For example, text = 'foo bar', chunks = ['foobar']


This doesn't match the (admittedly vague) spec


That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.


Well, the input should never look like the above. But if for some
reason it did, I wouldn't want the error; I'd want the indices. So:
text = 'foo bar'
chunks = ['foobar']
should produce:
[(0, 7)]
not an exception.

STeVe
Jul 19 '05 #6

P: n/a
Peter Otten wrote:
import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()


Thanks, that's a really nice, clean solution!

STeVe
Jul 19 '05 #7

This discussion thread is closed

Replies have been disabled for this discussion.