aligning text with space-normalized text

Steven Bethard

I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find. However, the chunks have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()). Some
example data to clarify my problem:

py> text = """\
.... aaa bb ccc
.... dd eee. fff gggg
.... hh i.
.... jjj kk.
.... """
py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

Note that the original "text" has a variety of whitespace between words,
but the corresponding "chunks" have only single space characters between
"words". I'm looking for the indices of each chunk, so for this
example, I'd like:

py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

Note that the indices correspond to the *original* text so that the
substrings in the given spans include the irregular whitespace:

py> for s, e in result:
.... print repr(text[s:e])
....
'aaa bb'
'ccc\ndd eee.'
'fff gggg\nhh i.'
'jjj'
'kk.'

I'm trying to write code to produce the indices. Here's what I have:

py> def get_indices(text, chunks):
.... chunks = iter(chunks)
.... chunk = None
.... for text_index, c in enumerate(text):
.... if c.isspace():
.... continue
.... if chunk is None:
.... chunk = chunks.next().replace(' ', '')
.... chunk_start = text_index
.... chunk_index = 0
.... if c != chunk[chunk_index]:
.... raise Exception('unmatched: %r %r' %
.... (c, chunk[chunk_index]))
.... else:
.... chunk_index += 1
.... if chunk_index == len(chunk):
.... yield chunk_start, text_index + 1
.... chunk = None
....

And it appears to work:

py> list(get_indices(text, chunks))
[(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
py> list(get_indices(text, chunks)) == result
True

But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)

Jul 19 '05 #1

Subscribe Post Reply

1631

John Machin

Steven Bethard wrote:
[snip]

And it appears to work: [snip] But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
Pythonic way[1] of writing this code?

Thanks in advance,

STeVe

[1] Yes, I'm aware that these are subjective terms. I'm looking for
subjectively "better" solutions. ;)

Perhaps you should define "work" before you worry about """subjectively
"better" solutions""".

If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']

Jul 19 '05 #2

Steven Bethard

John Machin wrote:

If "work" is meant to detect *all* possibilities of 'chunks' not having
been derived from 'text' in the described manner, then it doesn't work
-- all information about the positions of the whitespace is thrown away
by your code.

For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec which said that chunks
are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.

STeVe

Jul 19 '05 #3

Peter Otten

Steven Bethard wrote:

I have a string with a bunch of whitespace in it, and a series of chunks
of that string whose indices I need to find.**However,*the*chunks*have
been whitespace-normalized, so that multiple spaces and newlines have
been converted to single spaces as if by ' '.join(chunk.split()).**Some

If you are willing to get your hands dirty with regexps:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()
def main():
text = """\
aaa bb ccc
dd eee. fff gggg
hh i.
jjj kk.
"""
chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
47), (48, 51)]

if __name__ == "__main__":
main()

Not tested beyond what you see.

Peter

Jul 19 '05 #4

John Machin

Steven Bethard wrote:

John Machin wrote:
If "work" is meant to detect *all* possibilities of 'chunks' not
having been derived from 'text' in the described manner, then it
doesn't work -- all information about the positions of the whitespace
is thrown away by your code.

For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec

That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.
which said that chunks are created "as if by ' '.join(chunk.split())". For the text:
'foo bar'
the possible chunk lists should be something like:
['foo bar']
['foo', 'bar']
If it helps, you can think of chunks as lists of words, where the words
have been ' '.join()ed.
If it helps, you can re-read my message.

STeVe

Jul 19 '05 #5

Steven Bethard

John Machin wrote:

Steven Bethard wrote:
John Machin wrote:
For example, text = 'foo bar', chunks = ['foobar']

This doesn't match the (admittedly vague) spec

That is *exactly* my point -- it is not valid input, and you are not
reporting all cases of invalid input; you have an exception where the
non-spaces are impossible, but no exception where whitespaces are
impossible.

Well, the input should never look like the above. But if for some
reason it did, I wouldn't want the error; I'd want the indices. So:
text = 'foo bar'
chunks = ['foobar']
should produce:
[(0, 7)]
not an exception.

STeVe

Jul 19 '05 #6

Steven Bethard

Peter Otten wrote:

import re
_reLump = re.compile(r"\S+")

def indices(text, chunks):
lumps = _reLump.finditer(text)
for chunk in chunks:
lump = [lumps.next() for _ in chunk.split()]
yield lump[0].start(), lump[-1].end()

Thanks, that's a really nice, clean solution!

STeVe

Jul 19 '05 #7

Similar topics

Aligning Items in JComboBox

by: Hal Vaughan | last post by:

I have a JComboBox with a list of numbers, from 1 digit to 5 digits. Numbers with more than 3 digits have a comma in them. I've been aligning them with leading spaces. Is there any simple and...

Java

Aligning images in <form>s

by: steve | last post by:

Hello, I'm reworking our company site, trying to do it without tables. It's been a bit of an... adventure. Site is starting to come along. Problem I've run into is aligning image input...

HTML / CSS

Problem aligning list items in IE

by: Chris | last post by:

Hello, I have an unordered list with 3 levels, like this: <ul class="normalList"> <li>blabla</li> <ul> <li>blabla</li> </ul> <li>blabla</li> <ul>

HTML / CSS

Problem aligning DIVs

by: Tony | last post by:

I seem to be missing something in my understanding. If I leave off absolute positioning, shouldn't nested DIV/SPAN be displayed inside the parent, and ones outside that display separately? Here...

HTML / CSS

Aligning content vertically

by: Scott Teglasi | last post by:

Hi all, I was curious as to how others whom have dealt with this problem have handled it. The problem is that I have a site design of a fixed width and height and want to center that content...

HTML / CSS

aligning objects

by: tfs | last post by:

I have a table that is putting the object in the middle of my cell (<td>). I want to have the objects, such as my asp:textbox, to align towards the top of the cell. I don't want it to center...

ASP.NET

CSS - Aligning to the bottom of a DIV

by: S | last post by:

Here's a very basic question. . . I have a DIV that contains content that I need to be bottom-justified. What is the CSS code to do that? Thanks, ---------------S

HTML / CSS

aligning multiple query results

by: kiqyou_vf | last post by:

I'm trying to pull data from 2 different tables and do a loop to retrieve more than one row. I'm having problems with aligning the information. Can someone lead me in the right direction? I've done...

PHP

layout problem: aligning several datagrids vertically

by: Mark Wiewel | last post by:

hi all, i am a newbie in ASP.NET and i couldn't find the solution to this one: i have a form with three datagrids on it. i would like to align them vertically with a space between each grid of...

ASP.NET

Image Aligning XHTML 1.0 Strict

by: kestrel | last post by:

I have a problem, im trying to make my mebsite, XHTML 1.0 strict, and im going great, but i need my images to be aligned, touching together. with regular html, i can just put <img...

HTML / CSS

Cloud Servers without Credit Card and Email Registration: A Simpler Way to Get on the Cloud

by: CloudSolutions | last post by:

Introduction: For many beginners and individual users, requiring a credit card and email registration may pose a barrier when starting to use cloud servers. However, some cloud server providers now...

General

Wordpress or something else?

by: Faith0G | last post by:

I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

Content Management Systems

Access Europe: Command bars, the Access Shortcut Tool and a simple Audit Log - Wed 3 April

by: isladogs | last post by:

The next Access Europe User Group meeting will be on Wednesday 3 Apr 2024 starting at 18:00 UK time (6PM UTC+1) and finishing by 19:30 (7.30PM). In this session, we are pleased to welcome former...

General

Easy Steps to Fix "Canon Printer Won't Connect to WiFi Network"

by: taylorcarr | last post by:

A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...

General

How to turn on java script in a villaon keypad mobile phone

by: Charles Arthur | last post by:

How do i turn on java script on a villaon, callus and itel keypad mobile phone

Java

Batch import of multiple excel files into the database

by: ryjfgjl | last post by:

If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...

Data Management

Merging data from multiple Excel files

by: ryjfgjl | last post by:

In our work, we often receive Excel tables with data in the same format. If we want to analyze these data, it can be difficult to analyze them because the data is spread across multiple Excel files...

Data Management

Migrating Website to Cloud - Emmanuel Katto

by: emmanuelkatto | last post by:

Hi All, I am Emmanuel katto from Uganda. I want to ask what challenges you've faced while migrating a website to cloud. Please let me know. Thanks! Emmanuel

General

Navigating the Data Structures and Algorithms (DSA)

by: BarryA | last post by:

What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...

Algorithms / Advanced Math